Thank you for joining us online!

We’ve all had to evolve over the last year, and we think virtual conferences should evolve as well. With Failover Conf 2: Fail Smarter, we wanted to create a more engaging, collaborative conference.

With panel discussions, lightning talks, fireside chats, dance parties, pet slideshows, and more; this wasn’t like any other virtual conference. This year’s talks discussed how remote teams have evolved their cultures of reliability, how companies have evolved their incident response plans, and how Chaos Engineering has helped teams evolve from traditional testing. We learned how teams have adapted over the past year and engaged with others in the reliability community.

What's Next for DevOps

Emily Freeman, Author "DevOps for Dummies"
For over a decade, the DevOps movement has been using cultural change to power technological transformation and help companies deliver better products faster and more reliably. While many organizations have embraced this change and reaped the benefits, it hasn't come without challenges and many more remain. In this session, Emily Freeman (author of DevOps for Dummies) shares what's next for DevOps and how it will impact your organization.

Panel Discussion: The Evolution of Teams & Culture

Divya Balasubramanian, Senior Product Manager @ PagerDuty, Karishma Irani, Product Management Lead @ LaunchDarkly, Lena Reinhard, VP Product Engineering @ CircleCI, & Loretta Stokes, Director of Software Engineering Manager @ Eventbrite
The most successful organizations are the ones that embrace change and use it to become stronger and more resilient. In this panel discussion, we talked with engineering leaders about how they adapted to the challenges of 2020, what successes (and failures) they've seen, and where the future of reliable engineering is headed.

Fireside Chat: Jeff Smith and Matt Stratton

Jeff Smith, Director, Production Operations @ Centro & Matt Stratton, Host, Arrested DevOps podcast
Matt Stratton, host of the Arrested DevOps podcast, hosted Jeff Smith, Director of Production Operations at Centro and author of the book "Operations Anti-patterns, DevOps Solutions" for an engaging conversation about building reliable teams using DevOps principles.

Panel Discussion: The Evolution of Observability & Monitoring

Ashley Miller, Senior Director, Engineering @ Datadog, Daniel Khan, Director of Technology Strategy & Head of Open Source @ Dynatrace, Emily Nakashima, VP of Engineering @ Honeycomb, & Stijn Polfliet, Director Developer Enablement @ New Relic
Observability and monitoring are critical to detecting and troubleshooting problems to build more reliable applications. As our systems become increasingly complex, our tools for getting this crucial visibility and the way we respond need to evolve too. We sat down with SRE leaders to discuss the processes they use to get the most insight into their applications, how they've increase the speed of detection and response, and what organizations need to do to stay on top of growing complexity.

Fireside Chat: Jesse Robbins and Kolton Andrus

Jesse Robbins, Master of Disaster @ Heavybit & Kolton Andrus, CEO, Co-founder @ Gremlin
Long before Chaos Engineering was even a phrase, Jesse Robbins was Amazon.com's "Master of Disaster" using intentional failure to help the company become more reliable. Kolton Andrus (CEO at Gremlin), sat down with Jesse to learn more about his early work with GameDays, the evolution of reliability, and where the future of SRE lies.

Pragmatic Incident Response: Lessons learned from failures

Robert "Bobby" Ross, CEO, Co-founder @ FireHydrant
Incident response is overwhelming. So where do you start? There's a lot of advice out there, but it's mostly theories that aren't taking reality into account. So how do you get a process in place that actually works and scales? In this session, FireHydrant CEO and Co-Founder, Robert Ross shares quick stories from his experience as an SRE and what tips he’s learned along the way.

Leaving the Nest: Guidelines, guardrails, and human error

Laura Santamaria, Developer Advocate @ LogDNA
When we talk about reliable systems, we talk a lot about human error. Human error in an incident or a bug report is often treated with a bit of a facepalm reaction. The term masks a lot of scenarios from accidents to exhaustion to everything in between. However, human error helps us understand where our processes failed and how we can prevent the same error from happening again. In short, we need to think in terms of a framework of guidelines and guardrails. In this short talk, Laura discusses how guidelines like runbooks and guardrails like automation can help us address the fact that everyone will, at some point, make mistakes.

Fireside Chat: Ines Sombra and Ana Medina

Ines Sombra, Sr. Director of Engineering @ Fastly & Ana Medina, Senior Chaos Engineer @ Gremlin
Reliability is a requirement for the modern internet. Ana Medina joined Inés Sombra, Sr. Director of Engineering at Fastly, to discuss their approach to resilience, how the past year has influenced the way they work, and what practices your engineering organization can adopt to become more reliable.

Implementing DevSecOps in the DoD

Nicolas Chaillan, Chief Software Officer @ United States Air Force
Delivering software quickly and securely is important for every organization, but it's even more important at the US Department of Defence (DoD) where reliability directly impacts national security. Nicolas Chaillan (Chief Software Officer, US Air Force) discusses the DoD Enterprise DevSecOps Initiative - an initiative he leads along with the DOD’s Chief Information Officer that brings automated software tools, services and standards to DoD programs. He also shares about Platform One, the Air Force's DoD-wide DevSecOps Enterprise Level Service that provides managed IT services capabilities, on-boarding, support, and baked-in zero trust security. This insight from operating at the most rigorous level will help you level up your own organization.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started