Advanced Postmortem Fu and Human Error 101

John Allspaw (Etsy)
Velocity Culture Ballroom EFGH
Please note: to attend, your registration must include Workshops.
Presentation: external link
Average rating: ****.
(4.71, 45 ratings)

Successful companies embrace their failures as opportunities to make their applications and organizations resilient, and part of that means paying close attention to outage investigations and digging in deep to understand how your (likely complex) system failed.

I’ll use actual Etsy.com examples for digging into the anatomy of an outage, and what a blameless and satisfying postmortem meeting looks like.

Humans and their behavior under stressful conditions are also components of our architectures, and need just as much attention as load-balancers and schema changes do. I’ll talk about how the fields of Human Factors and Resilience Engineering converges on web operations, and what we can learn from those fields.

Some of the topics covered, all of which will have real-world illustrations:

  • Outage classifications and severity
  • Root Cause Analysis (RCA) methods: the Good, the Bad, and The Wrong™
  • How “hindsight bias” influences our culture and what to do about it
  • Human error types, forms, and surprising solutions to them (mistakes, lapses, slips, violations, etc.)
  • What adaptation and improvisation means in the face of failure
  • What a “Just Culture” means in our field of web operations
Photo of John Allspaw

John Allspaw

Etsy

John has worked in systems operations for over fourteen years in biotech, government and online media. He started out tuning parallel clusters running vehicle crash simulations for the U.S. government, and then moved on to the Internet in 1997. He built the backing infrastructures at Salon, InfoWorld, Friendster, and Flickr.

He is now VP of Tech Operations at Etsy, and is the author of The Art of Capacity Planning published by O’Reilly.

Comments on this page are now closed.

Comments

Picture of Suzanne Axtell
Suzanne Axtell
06/23/2011 11:13am PDT

Hey Justin, the video of John’s preso will be available to All Access Pass holder within a week. Glad you enjoyed it, thanks for the feedback!

Justin Brodley
06/23/2011 10:59am PDT

Hoping we get the video of this presentation soon!! Was definenlty a highlight and would love to rewatch.

Picture of H. "Waldo" Grunenwald
H. "Waldo" Grunenwald
06/20/2011 5:00am PDT

Fantastic presentation on doing Lessons Learned / Remediation instead of a Witch-Hunt. Highly Entertaining.

Picture of Daniel Cassiero
Daniel Cassiero
06/18/2011 12:53pm PDT

One of the best presentations of the Conference IMO.

Roy Rapoport
06/16/2011 11:37am PDT

Mind-bogglingly awesome of a presentation, John. Great balance of specific anecdotes, theory, and - of course - humor.

Ernest, to your point about company cultures, I think you’re spot-on. It’s not an Etsy thing specifically, of course—here at Netflix, we’ve got some work to do on post-mortems, but the Just Culture concept is strong.

But I’ll tell you this - I’ve worked in significantly older, more conservative, more corporate companies, and while the culture started out being very blame-based, I think you’d have been surprised by the ability of relatively few individuals to change the nature of post-mortems and turn them to be more accountability (rather than blame-) focused.

But hell, the role of company culture in EVERYTHING we talk about at Netflix cannot be overemphasized, and I’d really love to see more explicit conversations about it.

Picture of Eric McCraw
Eric McCraw
06/16/2011 11:03am PDT

Thank you Sir!

Picture of John Allspaw
John Allspaw
06/16/2011 10:24am PDT

Eric: done.

Picture of Eric McCraw
Eric McCraw
06/16/2011 9:36am PDT

The Postmortem FU slides are not downloadable! Would you please, please, save me a bunch of work and change that John?

Picture of John Allspaw
John Allspaw
06/15/2011 11:08am PDT

Thanks Ernest, I appreciate it.

I would love to say that we at Etsy have such hiring abilities. :)

I do indeed think it applies, because being “blame free” isn’t about being “accountability free”, and human error isn’t about the person, it’s about the organization. Focusing solely on the person (or the person’s characteristics) as a “cause” of error misses out on the opportunity to improve.

When there’s an action that contributes to an unexpected outcome, finding out why that action made sense to the person (at the time, given their environment) is key. If you’re only believing in the “if we can only get rid of these bad apples” theory to increase safety, and not working out the deeper issues, improvement is going to be unlikely. Like shoveling sand against a tide.

If you’re at Velocity, I’d love to talk more about this. :)

Picture of Ernest Mueller
Ernest Mueller
06/15/2011 9:47am PDT

Great talk. Loved all the points, but have a question on the “blame free” lessons – it would seem to be driven by culture like Etsy’s where you are able to hire only the “cream of the crop” – does it really apply as totally in “corporate”/government environments where you do have people who are disproportionate causes of error and slackitude?

  • Keynote Systems
  • Cisco
  • Google
  • Neustar
  • Betfair
  • Cotendo
  • Rackspace Hosting
  • Akamai
  • Apica
  • dynaTrace
  • Equinix
  • Facebook
  • New Relic
  • Opscode
  • Salesforce.com
  • Yahoo! Inc.
  • AppDynamics
  • Aptimize
  • Blaze
  • CDNetworks
  • Cedexis
  • Citrix Systems
  • Compuware Corporation
  • Dyn Inc.
  • F5 Networks
  • Heroku
  • Percona
  • Quest Software
  • Schooner Information Technology
  • SiteSpect
  • Splunk
  • Strangeloop
  • WatchMouse
  • Zeus Technology
  • Neustar

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Yvonne Romaine at yromaine@oreilly.com

Download the Velocity Sponsor/Exhibitor Prospectus

Contact Us

View a complete list of Velocity contacts