Skip to main content

Posts

Showing posts with the label AWS

More Eggs in More Baskets: How the AWS Outage Made Us Stronger

Like a lot of web companies, we learned some hard lessons from the Amazon Web Services outage of a few weeks ago.   We didn't lose a single byte of data, but we resolved never again to put one service provider—no matter how large and diversified—in a position where its failure could cause a serious interruption in service for our customers. As promised, we 've now finished setting up automated data backup and redundant server infrastructure in facilities maintained by a completely separate company:  Softlayer Technologies .  Like AWS, Softlayer maintains the high security and reliability standards we require, including SAS 70 Type II Certification and PCI DSS Compliance.  And their Texas location adds geographic diversity to the Virginia and California regions Amazon gives us access to. This is in no way the end of our efforts to improve reliability and security.  We'll keep refining backup, failover and recovery processes to ensure not only that our c...

Time to Fire the Sysadmin? What We're Doing About the AWS Outage

The downtime brought about by the massive failure at Amazon Web Services has now agonizingly stretched well into a second day, causing us to question practically everything we thought we knew about hosting web applications. Lurking behind it all is the nagging anxiety that maybe we should go back to a simpler time, when men were men, and ran their own machine rooms.  Then come the flashbacks, and we remember what that was really like.  All it took was a careless backhoe operator… The truth is that for all the pain and, yes, embarrassment, the answer is not to turn back the clock.  The answer is, as it almost always is in these situations, to use this experience to build something better.  Something that leverages the best of the new tools, with a deeper understanding of their risks. Another way to put it: Don't fire the sysadmin while he's trying to fix the servers.  Keep cool and get the crisis resolved, then do a full post-mortem to squeeze every drop ...