Track: Architecting for Success when Failure is Guaranteed

Location: Broadway Ballroom North, 6th fl.

Day of week:

Successfully Architecting for Failure has to include systems and people that work toward preventing failure. However, the complex and distributed systems of today often fail in ways that require a specific combination of variables in order to sneak past all the preventative barriers we’ve built and create new and exciting failure modes. The track recognizes this reality and includes talks that follow a structured idea of how to build systems and organizations that are best architected for success with the ability to quickly respond to.

Track Host: Dave Hahn

SRE in the Cloud Operations & Reliability Engineering organization @Netflix

Dave Hahn is a member of the SRE in the Cloud Operations and Reliability Engineering organization at Netflix. He has many years of experience in distributed systems, failures, and mis-attribution of complex problems to human error. Will talk for applause. Bad jokes likely.

10:35am - 11:25am

Making a Lion Bulletproof: SRE in Banking

Within ING, the largest bank of the Netherlands, we aim to be a tech company with a banking license. We have adopted DevOps as our way of working, use open source tools and technologies and adopt best practices from industry and the engineering community. However, we always have to take into account that we are a financial organization dealing with regulation and public opinion.

To improve the reliability of our services and keep up with regulator demands, we introduced SRE to the bank three years ago. This talk will cover history, present and future of our SRE team and practices. In doing so, we will touch upon people (hiring, coaching, organizational aspects, culture), process (way of working, education) and technology (observability, infrastructure), hoping to share lessons learned that can be applied to any organization starting or growing SRE, financial or not.

Janna Brummel, IT Chapter Lead Site Reliability Engineering @ingnl
Robin van Zijll, Site Reliability Engineer & Product Owner on the SRE Team @ingnl

11:50am - 12:40pm

How Did Things Go Right? Learning More From Incidents

Solely learning from failure isn't a fundamental–it's a limitation.

A look into the New View of Safety, Human & Organizational Performance, and Resilience Engineering shows us that safety, great performance, and sources of resilience do not come from the absence of failure, but rather the presence of adaptive capacity.

Navigating a perfect storm in a world where availability is made up and the 9's don't matter requires expertise. This talk will describe more rewarding ways to approach incident investigation without overly focusing on failure prevention.

  • What's going on when it seems like nothing is happening?
  • When failure does occur, what's going to keep it from being worse?
  • How do teams adapt successfully when preventative techniques fail?
  • How should we prioritize the effort to develop systems that help us safely manage the consequences of failure? 

These questions cannot be answered by trying to explain causes of failure and fixing remediation items.

We will move the needle forward and increase our opportunity for learning from success with some fundamental and practical ways that get us from, "Why did things to wrong?" to "How did things go right?"

Ryan Kitchens, Site Reliability Engineering @Netflix

1:40pm - 2:30pm

The Trouble With Learning in Complex Systems

The complexity of technology we actively design, build, and operate has eclipsed our ability to fully comprehend them. When continuous change is at the heart of our most precious systems, how do we balance protecting them while simultaneously improving the people and processes tied to making our tech more useful and valuable to end users as well as the business? A strong focus on learning as much about the system as possible is our best course of action, but learning requires both success and failure.  

In this talk, we’ll explore the challenges with learning in complex systems, the relationship between high and low stakes learning opportunities as well as the cost associated. Audience members will gain exposure to ideas and techniques to help to improve operational knowledge as well as mental models associated with our ever increasingly complex systems.  

By adapting to new methods of learning and creating space for more of our systems to be knowable, teams can remove the mask of process from our past to unveil a clearer view of the future.

Jason Hand, Senior Cloud Advocate @Microsoft

2:55pm - 3:45pm

Graceful Degradation as a Feature

The move from monolith to microservice has allowed pieces of functionality to be deployed individually and on demand. Having functionality isolated allows the opportunity for one microservice to fail without bringing down the whole system.

However, it also increases complexity with the number of API calls being made across all of these services. Each service has unique failure models, whether its a database, cache, queue, etc. How can you be sure that one single failure doesn’t cause an outage for your end users?

Landing the launches of new products and features and providing your users with a positive experience is crucial to your success. If something is to fail, you’d prefer they didn’t know. Or if they did, it shouldn’t interrupt their experience.

In this talk, we’ll cover graceful degradation as an engineering goal which can be confidently tested with Chaos Engineering. By purposely causing failure of one service at a time in a controlled environment, you can safely observe the effect on the end user, whether that’s on a laptop browser, a mobile app, or the result of an API call.

Lorne Kligerman, Director of Product @GremlinInc

4:10pm - 5:00pm

What Breaks Our Systems: A Taxonomy of Black Swans

Black swan events: unforeseen, unanticipated, and catastrophic issues. These are the incidents that take our systems down, hard, and keep them down for a long time.

By definition, you cannot predict true black swans. But black swans often fall into certain categories that we've seen before. This talk examines those categories and how we can harden our systems against these categories of events, which include unforeseen hard capacity limits, cascading failures, hidden system dependencies, and more.

Laura Nolan, Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.