Skip to main content

Time to Fire the Sysadmin? What We're Doing About the AWS Outage


The downtime brought about by the massive failure at Amazon Web Services has now agonizingly stretched well into a second day, causing us to question practically everything we thought we knew about hosting web applications.

Lurking behind it all is the nagging anxiety that maybe we should go back to a simpler time, when men were men, and ran their own machine rooms.  Then come the flashbacks, and we remember what that was really like.  All it took was a careless backhoe operator…

The truth is that for all the pain and, yes, embarrassment, the answer is not to turn back the clock.  The answer is, as it almost always is in these situations, to use this experience to build something better.  Something that leverages the best of the new tools, with a deeper understanding of their risks.

Another way to put it: Don't fire the sysadmin while he's trying to fix the servers.  Keep cool and get the crisis resolved, then do a full post-mortem to squeeze every drop of learning you can from the experience.

In that spirit, we wanted to share our plans for preventing and recovering from future outages.  The truth is these have been in progress for a while, but you can bet they'll now be exposed to a whole new level of scrutiny and outrank all other priorities until they're complete.

Bad Case Scenario

For starters, we will of course continue to follow the recommended Amazon Web Services practice by maintaining backups that replicate our data and software across multiple Availability Zones.  Availability Zones are, according to AWS documentation, designed so they do not share common points of failure, such as generators and cooling equipment.  In addition, they are physically separate, so "even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone."

Moving data across availability zones is fast, and is supported by powerful AWS snapshotting capabilities, so it's possible to make very frequent backups, and to recover quickly.

In the Bad Case Scenario where an Availability Zone fails, we'll be able to spool up new application and database servers in a separate Availability Zone, recover files from a recent backup snapshot stored in Amazon's highly stable S3 infrastructure, connect to an already-running live database backup, and be back online in less than a half an hour.  We'll use the AWS Elastic IP feature to eliminate DNS propagation issues which can sometimes delay restoration of access for some users.

But as we've seen, redundancy across Zones is not always enough.  While AWS maintains this is extraordinarily unlikely, the recent outage took out multiple Zones, which brings us to the next scenario.

Worse Case Scenario (like the current one)

In addition to maintaining redundancy across Availability Zones, we will also do so across Regions.  AWS maintains five Regions (each containing multiple Availability Zones) around the world.  There's one on the East Coast and another on the West Coast of the US.  To protect against the failure of an entire Region, we will maintain a live database backup on the opposite coast, and a complete file backup updated at least nightly (unfortunately, AWS snapshots cannot currently be made across Regions).

If all the Availability Zones on one coast go down, we'll start up pre-configured application and database servers on the other one, connect up to the live database backup, and restore from the nightly file backup.  It will take somewhat longer, since snapshots and the Elastic IP feature will not be available.  Also, content added since the last backup will not be available until the other Region is restored.  (Even in the current case, it does not appear data has been permanently lost).

Even so, we should be able to get back online within an hour.

Worst Case Scenario

So what if something happens to AWS as a whole, or at least they somehow lose both coasts?  For that case, we'll maintain what's called a company-diverse backup plan.  That is, we'll maintain server infrastructure with another provider completely separate from Amazon.  We'll keep nightly database dumps and file backups on servers there.  If the AWS data is truly gone from both coasts with no warning, there's the potential that up to a day's data could be lost, and there would be some delays in restoring service, since we're dealing with real, rather than virtual hardware, and real IP address changes, but it should still be possible to be online within a day.

Then we just have to worry about the guy with the backhoe.



Comments

Popular posts from this blog

Case Study: Orca Book Publishers Unifies its Digital Offerings Using the Tizra Platform

"We needed to find a way to keep all our of customers together on the same site," --Melanie Jeffs, Director of Digital Products, Orca Book Publishers Orca, a Canadian-based publisher of award-winning books for children, teens and reluctant readers, used to maintain separate websites and e-commerce platforms for its various digital offerings. The company had its free teaching resources under its own domain, separate e-commerce stores for the U.S. and Candian booklists, and a third site supporting paid subscriptions to its Text2Reader language arts resources. Maintaining these disparate platforms was a headache and didn't provide a smooth customer experience. "We needed to find a way to keep all our of customers together on the same site," said Melanie Jeffs, Director of Digital Products at Orca. After closely examining a number of different e-commerce solutions, Orca selected the Tizra Digital Publishing Platform because it offered: The ability...

See Tizra at the Frankfurt Book Fair

Guten Tag! If you are attending the Frankfurt Book Fair and thinking about your digital publishing strategy, we’d like to meet with you to tell you about all the great things happening at Tizra including: Our recent partnership with HighWire to power their Folio ebook platform (see below) How Goodheart-Willcox uses Tizra to create digital first content Exciting new features such as an improved e-reading experience , new mobile responsive design templates , and new APIs for faster uploading and better design integration Plus, some big news we can't tell you about until the show! Find Tizra at: The American Collective Stand Hall 8.0 S31 or email us at carlos.martinez@tizra.com to arrange a meeting. Please join us in congratulating HighWire on the launch of Folio! Built on the HighWire Open Platform and leveraging Tizra for ebook integration, Folio is a flexible, scalable, ebook solution, providing a user-friendly, intuitive reading experience t...

Tizra Upgrade Provides a Crisper, More Interactive E-Reading Experience

In the print world, when you think about a reader’s user experience, you consider factors like the size and weight of a book, paper quality, typeface, layout and design.  Moving to digital, some of these factors still hold true, but others are replaced with concerns such as speed, intuitive controls, cross-platform compatibility, plus as with any human interface, a host of intangibles.  We’re always working to make the Tizra reading experience crisper, easier, and less distracting, because happier readers mean happier publishers. Tizra reader upgrade makes it easy to enhance content with interactive lightbox effects. The update builds on Tizra’s ability to provide usability and compatibility across all the most popular web browsers and viewing devices, and is now available to all Tizra customers. Enhancements include:   Speed -- e-reading should be as crisp, fast and simple as turning a page. Your readers are not going to tolerate delays waiting for cont...