Presentation: Properties of Chaos

Track: Chaos, Complexity, and Resilience

Location: Soho Complex, 7th fl.

Duration: 10:35am - 11:25am

Day of week:

Slides: Download Slides

Level: Intermediate

Persona: General Software

What You’ll Learn

  • Learn how and why chaos engineering is being applied to autonomous vehicle safety.

  • Hear how property based testing principles can positively influence and enrich chaos engineering goals.

  • Understand how to advance chaos engineering practices to explore beyond basic properties like system availability and extend into verifying system correctness.

Abstract

Chaos Engineering makes up an essential component of our validation methods used in developing resilient, safety-critical autonomous vehicle software systems.

There's an adage in some functional-safety circles that goes something like, "The risk and danger live in the interfaces." Which, among other things, is a succinct way of stating that it's in the integration points where things most commonly breakdown. In a traditional safety-critical development process this focus on danger at the interface level is partially borne out of the assumption that the rigors of formalized safety-critical development processes (i.e. ISO 26262, IEC 61508, etc.) will have squeezed out serious issues in the design and various components that make up a system.

As it turns out, it is true that there's enormous opportunity for failure at the integration points between different systems and components, but it's also true that even some of the most rigorous SDLC processes available today leave room for unintended, undefined, or undesirable emergent behaviors elsewhere in the implementation. This problem is exacerbated significantly by the scale and complexity of the systems that are required to facilitate and operate an autonomous vehicle.

By automatically exploring the input space of chaos in a given system, we try to build stronger inductive proofs of our system's resilience semantics, so that we can augment the deductive proofs of correctness we derive from the use of formal methods in other facets of our solutions. The ultimate goal being to make assurances about safety-properties and resiliency-behaviors that would be otherwise impossible without the use of Chaos Engineering.

Question: 

Why a talk called Properties of Chaos?

Answer: 

Something I noticed consistently at QCon San Francisco and at both the Chaos Community Days in Minneapolis and San Francisco was that many people interested in or even practicing chaos engineering are still rooted in a world where they're mostly worried about whether or not the system is simply “up”. Largely answering an important but still fairly simple question, "Is my service still available?".

Availability as a subset of resiliency tends to be addressed via redundancy at one or more levels of a system; redundant processes, redundant data, redundant computers, etc. Getting to a system that has simple availability properties is straightforward once you start going down that path. For awhile you just add more redundancy. The bigger challenge becomes properties of correctness. Being alive and doing exactly & only what the system is supposed to be doing is a whole different thing. So what I want to try to do in this talk is focus a bit on suggesting that we level up our expectations of what to get out of chaos engineering, and in doing so consider how we as a community can utilize chaos engineering to make increasingly interesting and nuanced claims of our systems’ resiliency characteristics.

In this talk, I'll dive into PolySync’s understanding of chaos engineering and hopefully give people an idea of what new insights are possible when you adopt this methodology. I will also discuss a bit what our chaos testing framework looks like, so that people can have a sense of how they might use it when we open source it in the coming months.

Question: 

Where do you use Chaos Engineering in the development of self-driving cars?

Answer: 

PolySync’s products all exist in the autonomous vehicle stack below the autonomy applications that do things like object detection, path planning, etc. The safety-critical runtime we’re developing, Helios, for production Level 4 and Level 5 vehicles is essentially infrastructural as well.

What that means in this context is that somebody else’s system makes a decision, and then it is Helios’s responsibility to ensure that decision is actually executed reliably. We do this in part through resilient orchestration of the underlying hardware in the vehicle architecture. Which is just a fancy way of saying “managing redundancy”.

These software systems are extremely complex themselves and then all that complexity is also responsible for integrating with extremely complex electro-mechanical systems in the vehicle. Given how critical the nature of these systems is, it becomes essential that we be able to find faults in our own software, and also that we’re able to validate or invalidate the behavior of systems we must integrate with. This is where Chaos Engineering comes in.

Question: 

What am I going to walk away with after your talk?

Answer: 

You should have a sense of where chaos engineering fits into PolySync’s larger ethos of test and validation methods to try to ensure that our software is resilient and safe. You should get some insights into how we’re trying to make chaos engineering more approachable through the framework we’re building, Logos (named in the heritage of Heraclitus of Ephesus). You will learn a bit about the connection we’ve drawn between property-based testing and chaos engineering. Finally, you’ll hopefully feel a bit better about the sometimes scary and uncertain future of autonomous vehicle safety after understanding how we’re trying to influence their development and deployment.

Speaker: Nathan Aschbacher

Having spent the last several years designing fault-tolerant, high-availability, and high-assurance systems for large-scale data platforms, machine-learning pipelines, and global financial transaction processing; Nathan Aschbacher turned toward the concerns of understanding and advancing software functional-safety for autonomous vehicles.

Find Nathan Aschbacher at

Similar Talks