Chaos Conf 2020

Lead times and psychological safety within the five ideals

Gene Kim, IT Revolution

The biggest challenges engineering organizations face are not technical. They’re fundamental problems with how we think and go about doing work, and the environments that we work in.

In this talk, Gene Kim will share the Five Ideals and how they relate to Chaos Engineering. He’ll also show how the Five Ideals help build stronger, better performing, and ultimately more reliable companies.

Chaos Engineering: the path to reliability

Kolton Andrus, Gremlin

We’re all here for the same purpose: to ensure the systems we build operate reliably. This is a difficult task, one that must balance people, process and technology during difficult conditions.

We operate with incomplete information, assessing risks and dealing with emerging issues. We’ve found Chaos Engineering to be a valuable tool in addressing these concerns. Learn from real world examples what works, what doesn’t, and what the future holds.

Top 5 things you can do to reduce operational load

Rachel Obstler, PagerDuty

With the world shifting to everything online, digital dependency and pressure is higher than ever. In March, PagerDuty saw incidents double across the board for its customers, with significant spikes in industries like online learning and ecommerce. The pressure isn't letting up, nor are customer expectations.

Based on PagerDuty's data and conversations with thousands of customers, Rachel will talk about the easiest things you can do to make a big difference in reducing operational work from incidents. She'll also discuss ways to reduce duplicative efforts, surfacing issues, and improve response times to build more reliable teams.

Failing over without falling over

Adrian Cockroft, Amazon Web Services

Many organizations have disaster recovery (DR) failover plans that are poorly tested and implemented, and they are scared to test or use them in a realistic manner. This talk will show how we can use System Theoretic Process Analysis (STPA), as advocated by Professor Nancy Leveson’s team at MIT, to analyze failover hazards.

Observability and human understanding of safety margins and the state of a failover are critical to having a real DR capability. Chaos engineering, game days and a high level of automation provides continuously tested resilience, and confidence that systems will fail over, without falling over.

Scaling reliability

Nate Vogel, Charter Communications

How do you build a culture of reliability in a massive organization with well-established expectations of how to operate? A common assumption about enterprises is that everything moves at a glacial pace.

After growing Charter’s product data engineering team from a handful of engineers to 30, the company implemented a large reorg. This new data platforms group quadrupled in size to over 120 engineers, and responsibility for a mission-critical services platform that backs Customer self-service digital applications and portals. This set of services needed to grow their reliability and Chaos Engineering practice. Nate Vogel, VP, Data Platforms, will share how he grew the data engineering team with an emphasis on building a culture of reliability. He’ll discuss the processes and tools his team used to ensure Charter and its customers have the data and analytics necessary to drive the business. Nate will also provide insight on how to share a culture of reliability in the face of sudden team expansion.

Self-service Chaos Engineering: fitting Gremlin into a devops culture

Doug Campbell, Grubhub

In the era of DevOps and self-service culture, human processes are often harder than technical ones. Rolling out Gremlin to our infrastructure was easy, but enabling engineering teams to efficiently and safely practice Chaos Engineering was trickier.

In this session, I'll share how we rolled out Gremlin at Grubhub and how we educated and enabled all engineering teams to use it.

Stabilizing and reinforcing H-E-B's existing curbside fulfillment systems while reinventing them

Justin Turner, H-E-B

While going through the process of reinventing H-E-B's curbside and home delivery fulfillment systems, we had to spend significant effort to stabilize and reinforce the existing mission-critical systems to give us the cover needed to get to the finish line.

It took a blend of utilizing new services as anti-corruption layers as well as addressing complex technical debt and performance issues to improve our uptime and reduce business impact. It also took using our newly developed chaos engineering mindset to get creative in introducing failure to validate our fixes.

The more you know: a guide to understanding your systems

Tyler Wells, Twilio

As a platform provider, incidents and outages cost our customers money and it doesn't matter what your role is — developer, quality engineer, SRE, or even technical management — you must deliver trust.

Delivering trust is accomplished by shipping secure and reliable systems. And you have to know your systems in order to do that. I'll share how we developed a template that enables anyone at Twilio to understand their systems better, identify critical metrics to watch, and how to use Chaos Engineering to verify it all.

Let devs be devs: abstracting away compliance and reliability to accelerate modern cloud deployments

Rahul Arya, JPMC

Reliability is hard as complexity grows, and it makes shipping software difficult. The rigorous compliance requirements of the financial industry add additional challenges to developer velocity on modern cloud platforms. When you scale that up to an organization of JP Morgan Chase’s size with over 6500 apps and 50,000 engineers working across a global organization it can bring everything to a grinding halt.

In this session, Rahul Arya, Managing Director & Head of Global Technology Solutions Architecture at JPMC will share how they built a platform to abstract away compliance, make reliability with Chaos Engineering completely self-serve, and enable developers across the organization to ship code faster than ever.

Can chaos coerce clarity from compounding complexity? Certainly.

Matt Simons, Workiva

Let's go Black Swan hunting together. No no -- you can leave the guns at home. The camo too. No bait, traps, dogs, or calls needed. This is a very different kind of hunting, and the tool we need is chaos. You see, the swans we're hunting aren't sitting in a tranquil pond or gliding majestically over a clear lake on a beautiful, sunny day. These swans are hiding in your products.

They are hiding in your architecture, your infrastructure, and every dark-corner-turned-refuge created by layer upon layer of increasing system complexity. And these swans, these Black Swans, are not friendly or majestic creatures. They are wild, coked-out maniacs, whose singular purpose is to watch your products burn. So suit up! Grab some coffee, put on something comfortable, and follow me, chaos tools in hand. Let's get some birds.

Automating chaos attacks

Nikos Katirtzis / Daniel Albuquerque, Expedia Group

In an effort to build resilience into our services, we at Hotels.com and Expedia Group explored processes and tools to stress and 'break' our systems on purpose.

In this session we will show you how to run attacks in both manual and automated ways. This includes attacks that run as part of the CI pipeline, ones that run randomly in production using automation, or even experiments with chaos-as-a-service platforms which can be used in GameDays.

Certainty among the chaos

Marco Coulter, AppDynamics

Chaos engineering tests your application resiliency by thoughtfully injecting failure and starving resources. Complete failure is obvious, but how do you detect the warning signs of pre-failure stress?

This session takes the capabilities of chaos engineering beyond resiliency to support capacity optimization. You already need to monitor performance to see when your code is bending before it breaks. Why not glean more insight from the data so you can prioritize efforts and respond rapidly?

IBM’s principles of Chaos Engineering

Haytham Elkhoja, IBM

IBM has a long history of improving the reliability and availability of systems ranging from the largest of mainframes to the smallest of microservices. As part of cultural and organisational improvements we’ve sat down and codified a list of Chaos Engineering principles which define our view of Chaos Engineering.

These principles do not replace existing principles, but adapt them and match them to the requirements we have from our clients and from our own internal services. In this session we will describe a little of the process of getting engineers from across to agree on these principles and present the principles and lessons which we agreed upon.

Culturing resiliency with data: A taxonomy of outages

Ranjib Dey, IT Revolution

This talk provides an overview of the categorization of outages that happened in Uber in the past few years based on root cause types. We'll start with some background information, including definitions, incident management framework, and existing preventive techniques, aka best practices.

Followed by details and rationale around individual categories, sub-categories, and their relative distribution. Then we'll deep dive into two of the biggest categories: deployment and capacity with a focus on time series based data ming techniques to assist detection and simulation of some of the common root causes. Finally, we'll discuss the propagation of lessons learned in terms of policy and process changes based on these insights.

Convergence of Chaos Engineering and revolutionized technology techniques

Yury Niño Roa, ADL Digital Labs

Novel research areas such as the Internet of Things (IoT), Artificial Intelligence (AI), Cybersecurity, and Human Augmentation (HA) have demonstrated a big potential in the solution of specific problems. Medicine, Transportation, Software, Education, and Finances have been benefited by the progress of them. However, reaching this success requires assuming risks and failing many times to gain resilience.

This journey involves terms and techniques that we study in Chaos Engineering, so in this talk, we are to explore how these emerging paradigms can use Chaos Engineering to manage the pains in the path toward providing a solution. On the other side, we will show how Chaos Engineering can benefit from Artificial Intelligence for example. Further, we are going to propose a conceptual model to explore the influence of these emerging paradigms over Chaos Engineering and How to use the Chaos Principles to identify risks, vulnerabilities, and generate resilience solutions.

Lessons from incident management and postmortems at Atlassian

Jim Severino, Atlassian

How do you run incidents and postmortems at a company with thousands of engineers spread across the globe? Jim Severino shares what worked (and didn't worked) for Atlassian.

Breaking serverless things on purpose: Chaos Engineering in stateless environments

Emrah Şamdan, Thundra

Serverless enabled us to build highly distributed applications that led to more granular functions and ultimate scalability. However, it also brought the risk of failure from a single microservice to many serverless functions and resources. You might be able to predict and design for certain troublesome issues but there are many, many more that you probably will not be able to easily plan for. How do you build a resilient system under these highly distributed circumstances? The answer is Chaos Engineering: Breaking things on purpose just to experience how the whole system will react.

Join us as we walk through:

The unique challenges of building a highly resilient serverless app
Why you need to design for problems you cannot predict and cannot easily test for
How you can use plan your game days for chaos experiments with serverless components
How you can take advantage of out of the box and third-party observability solutions to measure the impact of chaos experiments.

Identifying hidden dependencies

Liz Fong-Jones, Honeycomb.io

You don't need to write automation or deploy on Kubernetes to gain benefits from resilience engineering! Learn how Honeycomb improved the reliability of our Zookeeper, Kafka, and stateful storage systems through terminating nodes on purpose.

We'll discuss the initial manual experiments we ran, the bugs in our automatic replacement tools we uncovered, and what steps we needed to progress towards continuously running the experiments. Today, no node at Honeycomb lives longer than 12 months, and we automatically recycle nodes every week.