What to do when something goes horribly wrong in production? Well of course we hope that it never happens, but there are occasions when mistakes occur or soething unexpected comes up and your servers start chewing memory and not completing connection, everything is going to hell. At the guardian our CMS has a number of architecture decisions made that allow us to recover from almost all forms of failure, and we’ll detail how some of these work, and why we made them work they way we chose to. Once you’ve managed to patch the system into such a state that it can recover, the next vital task is to reason out why it happened and how we can fix it. There is a method that we use when addressing serious site failures, and a number of tools and approaches that you can use after the fact to try to reinterpret what happened and trace back in time.
Michael Brunton-Spall is the Developer Advocate for the Guardian. He has worked at the Guardian for three years now, helping to build and scale the website. He has spent a lot of time helping to setup and run the platform team that manages internal, behind the scenes, performance and scalability issues. As a Developer Advocate, Michael speaks at conferences, organises conferences, supports users of the API’s and does training.
Lisa van Gelder is one of the Guardian’s senior web developers. Lisa has been developing software for 12 years and has been involved in building and scaling the Guardian’s main website as well as the comments system. Lisa has worked closely with Operations to diagnose and debug apps in production and is experienced in supporting the cleanup and diagnosis of major performance issues.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Yvonne Romaine at yromaine@oreilly.com
View the listing of Velocity Europe Media Partners. For more information, contact Norbert Weider at norbert.weider@ googlemail.com or Isabel Schmittknecht at Schmittknecht@book-fair.com.
View a list of Velocity Europe contacts
Comments
Very interesting, particularly the ideas of killing misbehaving apps and having them restart in 60 seconds.