Why the Yahoo FrontPage Went Down and Why It Didn't Go Down For up to a Decade before That

Jake Loomis (Yahoo!)
Operations Mission City
Average rating: ****.
(4.23, 13 ratings)

Yahoo!’s frontpage had a remarkable track record of stability. Over the years, more and more techniques have been implemented to ensure stability of the critical and highly profitable www.yahoo.com. A thorough incident management system ensures that the lessons learned from each previous incident are followed up on and continually add to the robustness of the application.

This session will cover in depth the top 5 techniques that contributed to its stability.

Description of techniques will include:
- Error proofing change: make the change (with forked production traffic) before really making the change
- Global loadbalancing and performance optimization
- Redundancy for everything, hardware, software, network, dns…heck, the entire internet
- Failure modes: everything can and will break, have a bandaid ready
- Monitoring/alerting: monitor every part of your application as well as everyone elses application

Lastly we will go into the causes for the last year’s outage and how each of these techniques failed to prevent it in this situation.

Technique overviews

Error proofing change:
Description of software release process. Includes the multiple phases of a release including:
1. Continous Integration environment with automated build, unit test, deploy, and test for each checkin.
2. QA environment with automated tests and debug statements where logs and monitors are closely watched during testing
3. Staging environment where the rollout process is tested with forked copies of production traffic
4. Production deployment – all code is dark launched and reviewed before activating in a phased rollout

Global loadbalancing and performance optimization
- Route traffic to nearest of over a dozen colos worldwide
- Ability to serve any country from any location
- Use in failure scenarios, maintenance, code changes , testing, etc.
- Able to sustain a complete outage in any international country or region whether network, power or act of god

Redundancy for everything
- Description of how to make DNS, network, servers, software, colo, dependencies, etc. redundant

Failsafe measures/Degrade gracefully
- Static page created every 15 mins to serve traffic in failure or traffic spike scenario
- Failed Dependencies degrade gracefully

Monitoring/Alerting areas – senior engineers with data cards debugging 24/7 within 5 mins of any alert
Extensive monitoring includes:
- System level monitoring
- End-to-end functionality checks per host
- All dependencies: success/failure rates
- Content “freshness”
- Performance – Server side duration
- Traffic levels – week over week
- Etc.

And lastly techniques will not cover everything, kick ass operational engineers are essential as is a close interaction with the development team.

Photo of Jake Loomis

Jake Loomis

Yahoo!

Jake Loomis is currently a VP of Service Engineering at Yahoo!, where he pioneered Yahoo!‘s efforts to be consistently reliable in a fast growing, rapidly changing environment. He is contributor to O’Reilly’s Web Operations book leveraging his experience of owning operational responsibility for many widely varied Yahoo! applications including Yahoo! Mail, Yahoo! Messenger, Flickr, Yahoo! Finance, www.yahoo.com and numerous others.

Comments on this page are now closed.

Comments

Justin Brodley
06/15/2011 4:02pm PDT

Tough Q&A session but great insights! Thanks for opening the kimono about failures in your org.

  • Keynote Systems
  • Cisco
  • Google
  • Neustar
  • Betfair
  • Cotendo
  • Rackspace Hosting
  • Akamai
  • Apica
  • dynaTrace
  • Equinix
  • Facebook
  • New Relic
  • Opscode
  • Salesforce.com
  • Yahoo! Inc.
  • AppDynamics
  • Aptimize
  • Blaze
  • CDNetworks
  • Cedexis
  • Citrix Systems
  • Compuware Corporation
  • Dyn Inc.
  • F5 Networks
  • Heroku
  • Percona
  • Quest Software
  • Schooner Information Technology
  • SiteSpect
  • Splunk
  • Strangeloop
  • WatchMouse
  • Zeus Technology
  • Neustar

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Yvonne Romaine at yromaine@oreilly.com

Download the Velocity Sponsor/Exhibitor Prospectus

Contact Us

View a complete list of Velocity contacts