QCon New York June 15-19, 2020 | | The Service Mesh: It's About Traffic

This presentation is now available to view on InfoQ.com

What You’ll Learn

Hear about introducing service mesh at Twitter and some of the challenges encountered.
Find out what are some of the pitfalls adopting a service mesh architecture.
Discover how Buoyant and Linkerd can help with regards to adopting a service mesh.

Abstract

The "cloud native" ecosystem---largely centered around Kubernetes and microservice architectures---has taken the industry by storm. As teams rush into this brave new world, they quickly find that there's a need for a new set of instrumentation and tooling primitives. This can be overwhelmingly complex, and without a disciplined, incremental approach, these migrations can be doomed to fail.

In 2016, we launched the Linkerd project, a service mesh hosted by the Cloud Native Computing Foundation, to give operators control over the traffic between their microservices. Since then, dozens of organizations have adopted Linkerd as a component of their cloud native architecture; and we've learned a ton about the pitfalls and pratfalls of adopting a service mesh.

In this talk, we'll discuss:

How the service mesh feature set developed organically at early cloud native companies like Twitter
The lessons we've learned helping dozens of organizations get to production with Linkerd
How we've applied these lessons to tackle complexity with Linkerd

Question:

Tell me a bit about you, Buoyant and Linkerd.

Answer:

Before Buoyant I worked with large Internet companies, Yahoo and Twitter, and Twitter is where I learned the most about what we're doing at Buoyant and with Linkerd. Twitter is working on a framework called Finagle. We're building on a microservice architecture. At Twitter we didn't have those terms or the awareness that that's what we're doing. We called it SOA, but we were starting to build that architecture and Finagle was the first service mesh in our minds where there was a control plan functionality that we operated with ZooKeeper and various other components, and then a very smart data plane could leverage that for application owners.

As we dealt with incidents on Twitter a lot of our lessons and reliability concerns were implemented in that data path, in Finagle. So when I was leaving Twitter we wanted to have that same data plane value. We knew that microservices we're going to be successful, we had to have that smart data plane that could handle that concerns but we didn't expect the world to start writing Finagle which is a Scala/JVM library which is great if you love it but if you're not in that ecosystem it's not that great. But at the same time Go was coming out and all these other proliferation of new languages, Node is getting attractive and seeing a lot of shops not being able to move everything into one platform, we thought we have to pull this out and do a separate component into the infrastructure, as Linkerd as a proxy to start. That started back in the Mesos days where we had a very similar set of concerns but a slightly different operating model, so a JVM core host was a reasonable pattern when we started. And then as Kubernete came out and now we realized that we can have a much lighter weight data plan. And so we spent about a year and a half, two years rewriting Linkerd v2. And Linkerd 2 is a very communities-focused system. You want a very generic system. You could basically build arbitrary complex service meshes and lots of people wanted us to bridge together Mesos and Kubernetes and Docker or Nomad and do all these very complex things.

And that was great for the set of users who were building these platforms. Fast forward a few years you see Kubernetes be the king, the de facto scheduler. And Kubernetes has a very different focus from what we're doing back then. Kubernetes has a simple set of abstractions that you can get started with and operate and then the building blocks. Kubernetes is a platform that we want to build into. We don't want to try to extract all these different types of platforms. With Linkerd our goal is to make Kubernetes native so it just fits right in. There's a lot less learning in configuration, building your own glue layers and we can get you the value proposition of observable, debuggable, reliable secure data plan for any language with minimal cost. That's been the focus: getting as close to Kubernetes and into that ecosystem as we can.

Question:

Tell me about the use cases that you're going to be talking about. Getting to Kubernetes. Is that what's going to be here?

Answer:

One, I want to talk about the lessons from Linkerd 1 that we applied to Linkerd 2, that complexity in getting started, focusing on the value proposition that we think needs to be there. That is about Linkerd 2 of course, but service mesh shouldn't be another layer of complexity that you have to go out on your system afterwards. It has to make your Kubernetes experience better and easier to debug and easier to use. I want to focus on our philosophy of providing value out of the box without configuration, without having a whole bunch of knobs that you have to go learn.

Question:

One of the gotchas with a server mesh is people just expect, oh, I've got a service mesh, I'm fully observable I can totally run this thing with just what comes out of the box. Are you going to touch on that story?

Answer:

Yeah. Service mesh is new and there has been so much marketing in that space, we're a part of that. Google obviously is part of it as well, but the category is not driven by use case at the moment. I have microservices. I have Kubernetes. How do I run them? What we've seen is that organizations go into that. I'm just collecting a set of tools and I'm going to figure out how to use them. They struggle in complexity because the tools are not necessary designed to work together and they're adopting 10 things at once. So not only are they adopting Kubernetes and a service mesh, they're adopting new CI/CD practices, and they're adopting new frameworks in their code and they're boiling the ocean. What we've seen and the lessons we want to impart are that we should use an incremental approach. You can't boil the ocean to do a migration, it has to be tactical, so introducing a service mesh should be one step. It's not, here's a whole bunch of primitives you have to learn. We'll give you this sort of value props which maybe is mTLS. That's the real key thing some people want to solve with the service mesh. So, yeah, you get mTLS and you get observability as well. And then you can start to layer in more of the flow there, but the story we want to tell is how you can get lost in this complexity and how there are some solutions for that.

Question:

When you talk about getting observability, like Finagle gave you. Do you go into what's required?

Answer:

Yeah. Traffic is another important detail where again because people are familiar with APM solutions in our application performance monitoring. The service mesh provides a different set of data and people look at it, oh, this will be APM from microservices, which is true but it's what the service mesh will tell you is about the traffic between the services. We don't do any configuration on the data plane. We just detect protocols transparently; if it's a protocol we know like HTTP, we can get you rich data that's tied to Kubernetes system, and we can leverage the Kubernetes data model and let you do Prometheus queries out of the box. So it is about that same value prop you get from Finagle, like, oh great, I get these dashboards and metrics for free. The batteries are included, fealing. But that's being an incremental thing that you pull into your stack, it does not have to be a big adoption.

Question:

Are the stories that you're going to talk about stories of adopting a service mesh, the lessons of moving on to Kubernetes, or are they stories specific to Linkerd?

Answer:

They're not stories necessary specific to Linkerd. I talked to lots of other companies. Twitter right now is in the process of adopting Kubernetes but they have the same problems. Whether you're using Linkerd or just trying to move Twitter on to Kubernetes, you're dealing with the same sort of organizational friction, there are no green fields, so how do you move people onto this platform, your engineers on its platform in a way that is incremental and controlled.

Question:

So the reference implementation is Linkerd's path to Kubernetes but the problems you're facing are common. You may talk specifically about some things that hit Linkerd 2 but it's more about the reference implementation for solving that problem.

Answer:

Yeah, the story is about how you get your organization on Kubernetes. That's the bottom line. That's what we're all doing here. And how do we do microservices. We will use Linkerd as the lens that we look at those problems.

Question:

Who is the core audience that you're talking to?

Answer:

The main persona we're talking to are the people bringing Kubernetes, that are responsible for that platform and view. They run the infrastructure for the engineering teams. I'm happy if developers are coming and curious about this but we see it more of a DevOps problem.

Speaker: Oliver Gould

Co-Founder & CTO @BuoyantIO

Oliver Gould is the creator of Linkerd and the CTO of Buoyant, where he leads open source development efforts. Prior to joining Buoyant, he was a staff infrastructure engineer at Twitter, where he was the technical lead of Observability, Traffic, and Configuration & Coordination teams.