Tizra Blog

Posts

Showing posts from April, 2011

Time to Fire the Sysadmin? What We're Doing About the AWS Outage

The downtime brought about by the massive failure at Amazon Web Services has now agonizingly stretched well into a second day, causing us to question practically everything we thought we knew about hosting web applications. Lurking behind it all is the nagging anxiety that maybe we should go back to a simpler time, when men were men, and ran their own machine rooms. Then come the flashbacks, and we remember what that was really like. All it took was a careless backhoe operator… The truth is that for all the pain and, yes, embarrassment, the answer is not to turn back the clock. The answer is, as it almost always is in these situations, to use this experience to build something better. Something that leverages the best of the new tools, with a deeper understanding of their risks. Another way to put it: Don't fire the sysadmin while he's trying to fix the servers. Keep cool and get the crisis resolved, then do a full post-mortem to squeeze every drop ...