Presentation: Control Planes: Designing Infrastructure for Rapid Iteration

Track: Container and Orchestration Platforms in Action

Location: Broadway Ballroom South, 6th fl.

Duration: 11:50am - 12:40pm

Day of week:

Slides: Download Slides

Level: Intermediate

Persona: Architect, Developer

This presentation is now available to view on InfoQ.com

Watch video

What You’ll Learn

  • Learn  some of the tradeoffs of different orchestration systems.

  • Hear Clever’s story of deploying to ECS and some of the lessons learned.

  • Understand that you can drive decision making based on what engineers want to use.

Abstract

As a small engineering team of 40 at Clever, we aim to focus all of our efforts on building feature depth and improve resiliency. As a company focussed on K-12 education, we want to maximize time working with our customers and not on building orchestration infrastructure. However, we also know that well designed infrastructure and developer tooling allows us to move faster safely.

Our infrastructure team mirrors our product teams’ extreme extreme focus on user experience, and we constantly evaluate our options. Over time we have moved our container orchestration system from a internally built prototype in 2014 to Mesos Marathon and finally Amazon Elastic Container Service. We build infrastructure when required, but move to an off-the-shelf solution when it satisfies our requirements to minimize ongoing maintenance. This has allowed our small team to build reliable products that support education in over 60% of K-12 schools in the US.

In this session I want to share our learnings on how to build developer control planes to allow your infrastructure team to make changes without disrupting engineers. Specifically I will talk about

  • Lessons learnt about building control planes using snapshots of our own service deployment orchestration tooling over the last four years. A lot of our building blocks are available as public repositories on Github
  • Designing infrastructure tooling for rapid evolution and change using examples from the rollout of our batch processing system over the last year.
  • Evaluation and decision making frameworks for choosing between using cloud-managed, open source and build-your-own options through our own move from self-hosting containers to using a containers-as-a-service platform.
Question: 

What is your field of working for today?

Answer: 

I've been at Clever for since the start (five and a half years while the company itself started around six years ago).  I joined as a software engineer, started focussed on infrastructure and security and also started the infrastructure team. Due to my tenure, I do end up dealing with things such as old database instances and the alerts and metrics for them.

However, as a technical product manager for infrastructures and security most of my time is spent in planning for the coming year. What that means is, for now, our biggest focus is on resiliency. We are expecting to grow some of our core usage like 4 to 5x within a month.

Clever is an education company, so we get all of our users in one month between August and  September. We drop down to very low usage during the summer as everybody's on vacation and then when people come back, all of the work that we've done over the last year sees use. And even that happens in chunks when the East Coast wakes up and kids go to school for example.

Question: 

What do your systems do? What is it actually managing?

Answer: 

We are an enterprise company, but for the education space. So, we connect school districts, primarily public school districts, to education apps that they use in the classroom. We provide everything from account management, security (and thinking of data), single sign-on and a portal.

If you use Chromebooks which is most schools in the country are using right now, you log in to the Chromebook using Clever. Kids can log into the Chromebook without entering a password by using Clever Badges (which use QR codes).

Basically, when a class starts, we get thousands of authorization requests of different kinds from different schools.

Question: 

So, you use Amazon Elastic Container Service to spin up different instances for the applications that they need to serve?

Answer: 

Yes. Using microservices was not an explicit decision that we ever made. We just found ourselves there. So, you know, we have been on about 400 different applications in our cluster, and we have 40 engineers.

Question: 

So, is it all Go (Golang)? Is that what you said before?

Answer: 

Yes, but we started on Node.js and MongoDB. MongoDB is still our primary data store, but we have Polyglot database store right now and most of our backend services are in Go.

Question: 

As we were talking before, you mentioned that you've tried all these orchestration tools, you've done your own scripts, you've moved to Marathon, and, ultimately, made it to ECS. How are you going to tell the story?

Answer: 

We are an education-focused company that works with public schools. Most employees at Clever, joined the company to make impact in the classroom. Engineers at Clever really care about product delivery. They do care about solving complex problems, but they mostly care about the customer.

This is the primary driving factor for our technical infrastructure. How can we as an infrastructure team drive ourselves out of business every six months? We are a small team and only want to be solving problems that directly affect our customers. When we saw Docker, we realized that would allow engineers to focus very clearly on their application and completely isolate themselves from ‘infrastructure needs’.

Early on use of Docker suddenly took us over, just because everybody wanted to use it, and we were waiting for it. We rewrote scripts so the Docker containers would go on standard EC2 instances. It was just running one Docker container on an instance or two Docker containers on an instance, with no orchestration. But it pretended to be an orchestration system.

Coming to your question, the story that I think that is exciting is how, like most things, we drove our decision making based on what engineers wanted to use, building the user interface and the tooling and then using smoke and mirrors in the background to make that happen.

That allowed us to look at what our needs were at a specific time. For example, the first issue that we faced was we were developing a lot of new asynchronous jobs and we had to deploy them quickly. EC2 instances were becoming slow and becoming too expensive. So we had to make a change. Mesos was a system that we used then because we couldn't figure out a good solution of getting the load balancing to work or services to work right and we had to do asynchronous work.

We kind of moved to Mesos and then we had a couple of senior engineers look into how you get load balancing working. While we were doing that, Kubernetes become big, so we started looking into it. We used Kubernetes for services for a little bit, and then ECS came out which allowed us to kind of use our existing infrastructure and move much faster than we were with speed of Kubernetes at that time.

Question: 

Who is your main audience for this talk?

Answer: 

The main audience is architects, engineers, and infrastructure engineers. Anybody who cares about velocity or resiliency of an engineering team. I think that is the real focus  is around organizational and team level engineering productivity.

I’d like engineers, through our stories to be able to create space for technical experimentation while building complex systems. And to evolve rapidly without thrashing the engineering team’s velocity. We focus a lot of allowing us time to think about decisions carefully while still also delivering tools to engineers.

Even with infrastructure, you can start from the user interface. You don't have to solve all the technical problems first. You can look further into the future to your ideal architecture, knowing that others will fix many of those problems because many others are facing the same issues. Some features are more important for your team than everybody else. Those are the solutions you need to be focussed on.

Speaker: Mohit Gupta

Product Manager, Infrastructure @clever

Mohit works at Clever to ensure that engineers at Clever have the tools and services that allow them to develop and release products to our users reliably, continuously, with flair and fun. Mohit has a background in technology policy, ethnography and science studies and has worked on building Clever for over five years. In the past he's been part of the Electronic Frontier Foundation, Microsoft Research and UC Berkeley's School of Information.

Find Mohit Gupta at

Similar Talks

Scaling Infrastructure Engineering at Slack

Qcon

Senior Director of Infrastructure Engineering @Slack

Julia Grace

Multi-Language Infrastructure as Code

Qcon

Founder and CEO @PulumiCorp

Joe Duffy