Presentation: Chick-Fil-A: Milking the Most Out of 1000's of K8s Clusters
This presentation is now available to view on InfoQ.com
Watch videoWhat You’ll Learn
-
Hear a non-conventional use case of kubernetes that involves a high number of clusters, rather than a high number of containers.
-
Learn what Chick-fil-a saw as the compelling reason to shift from Swarm to Kubernetes.
-
Learn about new tools Chick-fil-a developed to address needs in their use case.
Abstract
Last year, I shared about how Internet of Things and Edge Compute are providing a new platform for Chick-fil-A to transform its in-restaurant operations, from the customer experience to kitchen automation. At that time, we were running Docker Swarm to manage our container-based microservices at the Edge. We have since shifted to running Kubernetes.
The nature of our business requires an interesting scale. While most companies running Kubernetes in production are running thousands of containers over tens of clusters, Chick-fil-A is running tens of containers over thousands of physically distributed clusters. This scale “breaks” some of the native tooling and requires significant control plan developments.
In this session, Brian Chambers (Architecture) and Caleb Hurd (SRE) will share about how Chick-fil-A manages connections and deployments to our restaurant Edge Kubernetes clusters using two to-be-announced open source projects. You will learn how we obtain operational visibility to our services, including logging, monitoring, and tracing. We will also share early lessons and battle stories learned from running Kubernetes at the Edge.
How does the Kubernetes implementation at Chick-Fil-A differ from more common approaches to Kubernetes deployments?
Brian: Most of the people that we see in the industry that are running Kubernetes clusters at any kind of scale (or really running and big container platforms) generally have a cloud-based infrastructure that they run in AWS (or Google). Most companies have a single or a few sizeable Kubernetes clusters with a large number of nodes (on the order of hundreds or even thousands). Each of these clusters then run tens to hundreds of thousands of containers across them. We have a similar infrastructure in our cloud environment from a control plane perspective.
We are a little different in that we are running a Kubernetes clusters of 3 nodes in each of our restaurants. This amounts to roughly 2000 clusters at scale. The number of container instances is more in the tens-of-containers range per restaurant. Our scale is massive but in a unique way. We run containerized, highly available, business-critical applications in Kubernetes, but in a very small footprint.
What problems does this use case present?
Brian: The Kubernetes ecosystem is awesome. But there are some challenges we have it does not address. One example is deployments. We want to be able to provide a technology platform that lets us move at the speed of our business in our restaurants. This means being able to roll out highly distributed changes to production very frequently. During the talk, we will talk about a tool we build called “Fleet” that we use to manage deployments to our restaurants. There are some kubernetes native tools to help with deployments, but we found they came up short for the type of environment we had.
Caleb: The practice of clustering on bare-metal Kubernetes is still not mature so there isn’t a lot of support available. The number of restaurants we have to roll-out to is large, and we usually have a non-technical person doing the installs. So the devices have to come online and self-cluster themselves with little to no intervention. The devices we ship to the restaurant have to be smart enough to find each other and also be able to self-heal. So in the event that one of the nodes drops off, the other nodes should re-cluster themselves, without dropping workloads. Achieving this has been a challenge.
At last year’s conference, you spoke about Swarm now we're talking about Kubernetes. Will you be talking about why this shift to Kubernetes?
Brian: Yes, I will be giving a clear explanation on why we moved to Kubernetes and the other alternatives considered.
Who is the intended audience?
Caleb: Even though, I am an SRE myself. I am going to focus on senior software developers with some SRE interest. The focus of the talk will be on ‘This is how it helps us deliver software’ and not just on ‘This is how we have the structure working’.
Brian: I would not pick a role necessarily, but I would like to address people who are working on Kubernetes or container orchestration at a significant scale. I think what we are doing will be interesting to them even if their problem space is a bit different. The talk should give them a different perspective.
What do you feel is the most important trend in software today?
Caleb: Everything we do as developers have no value until it goes into production. I think the industry is now trying to peel all the layers between developers and production code. Container orchestration is one step towards that. So I think the important trend is a philosophical shift towards deploying an idea into production quickly. I hope that in a year or two from now, the SRE’s and DevOps would be absorbed into the software development world and we would all be developing features and launching projects directly into production because the operational layers would have been abstracted away.
Brian: I completely agree with Caleb. The purpose of building software is to create value for businesses. And we should work towards maximizing the time spent on that versus the time spent on dependency management, orchestration, and availability. While this was very challenging in the past, there are a lot of great technologies that are making it possible today.
Similar Talks
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB maintainer, Co-founder & CTO @CockroachDB
Peter Mattis
Peloton - Uber's Webscale Unified Scheduler on Mesos & Kubernetes
Staff Engineer @Uber
Mayank Bansal
Alibaba Container Platform Infrastructure - a Kubernetes Approach
Senior Staff Engineer in Alibaba Container Platform Group
Fei Guo
How to Evolve Kubernetes Resource Management Model
Software Engineer @Google Kubernetes team
Jiaying Zhang
Securing a Multi-Tenant Kubernetes Cluster
OpenShift Senior Principal Product Manager @RedHat
Kirsten Newcomer
High Performance Cooperative Distributed Systems in Adtech
VP of Engineering @Forensiq
Stan Rosenberg
Managing Kubernetes with Istio
Software Developer/Developer Advocate @IBM
Mofizur Rahman
Kubernetes AMA w/ Fei Guo & Oliver Gould
Co-Founder & CTO @BuoyantIO