QCon New York June 15-19, 2020 | | Chick-Fil-A: Milking the Most Out of 1000's of K8s Clusters

This presentation is now available to view on InfoQ.com

What You’ll Learn

Hear a non-conventional use case of kubernetes that involves a high number of clusters, rather than a high number of containers.
Learn what Chick-fil-a saw as the compelling reason to shift from Swarm to Kubernetes.
Learn about new tools Chick-fil-a developed to address needs in their use case.

Abstract

Last year, I shared about how Internet of Things and Edge Compute are providing a new platform for Chick-fil-A to transform its in-restaurant operations, from the customer experience to kitchen automation. At that time, we were running Docker Swarm to manage our container-based microservices at the Edge. We have since shifted to running Kubernetes.

The nature of our business requires an interesting scale. While most companies running Kubernetes in production are running thousands of containers over tens of clusters, Chick-fil-A is running tens of containers over thousands of physically distributed clusters. This scale “breaks” some of the native tooling and requires significant control plan developments.

In this session, Brian Chambers (Architecture) and Caleb Hurd (SRE) will share about how Chick-fil-A manages connections and deployments to our restaurant Edge Kubernetes clusters using two to-be-announced open source projects. You will learn how we obtain operational visibility to our services, including logging, monitoring, and tracing. We will also share early lessons and battle stories learned from running Kubernetes at the Edge.

Question:

How does the Kubernetes implementation at Chick-Fil-A differ from more common approaches to Kubernetes deployments?

Answer:

Brian: Most of the people that we see in the industry that are running Kubernetes clusters at any kind of scale (or really running and big container platforms) generally have a cloud-based infrastructure that they run in AWS (or Google). Most companies have a single or a few sizeable Kubernetes clusters with a large number of nodes (on the order of hundreds or even thousands). Each of these clusters then run tens to hundreds of thousands of containers across them. We have a similar infrastructure in our cloud environment from a control plane perspective.

We are a little different in that we are running a Kubernetes clusters of 3 nodes in each of our restaurants. This amounts to roughly 2000 clusters at scale. The number of container instances is more in the tens-of-containers range per restaurant. Our scale is massive but in a unique way. We run containerized, highly available, business-critical applications in Kubernetes, but in a very small footprint.

Question:

What problems does this use case present?

Answer:

Brian: The Kubernetes ecosystem is awesome. But there are some challenges we have it does not address. One example is deployments. We want to be able to provide a technology platform that lets us move at the speed of our business in our restaurants. This means being able to roll out highly distributed changes to production very frequently. During the talk, we will talk about a tool we build called “Fleet” that we use to manage deployments to our restaurants. There are some kubernetes native tools to help with deployments, but we found they came up short for the type of environment we had.

Caleb: The practice of clustering on bare-metal Kubernetes is still not mature so there isn’t a lot of support available. The number of restaurants we have to roll-out to is large, and we usually have a non-technical person doing the installs. So the devices have to come online and self-cluster themselves with little to no intervention. The devices we ship to the restaurant have to be smart enough to find each other and also be able to self-heal. So in the event that one of the nodes drops off, the other nodes should re-cluster themselves, without dropping workloads. Achieving this has been a challenge.

Question:

At last year’s conference, you spoke about Swarm now we're talking about Kubernetes. Will you be talking about why this shift to Kubernetes?

Answer:

Brian: Yes, I will be giving a clear explanation on why we moved to Kubernetes and the other alternatives considered.

Question:

Who is the intended audience?

Answer:

Caleb: Even though, I am an SRE myself. I am going to focus on senior software developers with some SRE interest. The focus of the talk will be on ‘This is how it helps us deliver software’ and not just on ‘This is how we have the structure working’.

Brian: I would not pick a role necessarily, but I would like to address people who are working on Kubernetes or container orchestration at a significant scale. I think what we are doing will be interesting to them even if their problem space is a bit different. The talk should give them a different perspective.

Question:

What do you feel is the most important trend in software today?

Answer:

Caleb: Everything we do as developers have no value until it goes into production. I think the industry is now trying to peel all the layers between developers and production code. Container orchestration is one step towards that. So I think the important trend is a philosophical shift towards deploying an idea into production quickly. I hope that in a year or two from now, the SRE’s and DevOps would be absorbed into the software development world and we would all be developing features and launching projects directly into production because the operational layers would have been abstracted away.

Brian: I completely agree with Caleb. The purpose of building software is to create value for businesses. And we should work towards maximizing the time spent on that versus the time spent on dependency management, orchestration, and availability. While this was very challenging in the past, there are a lot of great technologies that are making it possible today.

Speaker: Brian Chambers

Enterprise Architect @ChickfilA

Brian Chambers is an Enterprise Architect at Chick-fil-A in Atlanta, GA. He focuses on delivering new platforms and capabilities like Self-Service Analytics, Cloud, Internet of Things to the business. Most recently, he has been focused on building out an Internet of Things Platform to enable Chick-fil-A’s next generation restaurant. He also enjoys spending time researching and understanding emerging technologies and finding ways to integrate them into Chick-fil-A’s technology strategy.

Find Brian Chambers at

Speaker page

https://www.linkedin.com/in/brian-chambers-65960168/

Speaker: Caleb Hurd

Site Reliability Engineer @ChickfilA

Caleb Hurd is an Site Reliability Engineer at Chick-fil-A with a broad background that includes Fortune 500’s, startups and everything in between. However, his passion is specifically in enabling small development teams to deliver their software reliably and rapidly to customers by empowering them with the tools and pipelines to do so.

With the recent explosion in incredible, but also confusing technologies that enable rapid delivery, Caleb enjoys fitting together just the right tools for the right job.

He enjoys photography, coffee and his kids... He also happens to work on SRE stuff in his spare time.