Tech Blog :: Yesterday's Cloud Collapse

Apr 22 '11 10:33am

Yesterday's Cloud Collapse

Amazon's EC2 cloud hosting system went down for several hours yesterday. I first noticed the disruption because my dotCloud instances (which I've been playing with for Drupal feasibility) stopped responding. Then a server I'm running on a totally different hosting service,, went down at the same time. (It turns out that was just a coincidence; does not run on EC2 according to their support staff.) Anyway, the simultaneous outage made me think it was more than a coincidence, so I googled "cloud outage" and found a breaking CNN story.

Mashable explains the problem exposed by the outage: EC2 is supposed to be redundant across multiple "Availability Zones," but a cascading failure still managed to bring down the whole system. That article links to a more detailed explanation of what happened.

I expect there's going to be a knee-jerk reaction now among some [mostly old-school] sysadmins away from the cloud, back to co-locating physical servers in a data center. But I worked at a company that hosted dozens of sites that way, and when the data center had a fire and lost power, their sites all went down for days. The cloud is just an abstraction inside a physical machine. It's an abstraction that allows for tremendous efficiency, cost-savings, and redundancy. But physical failures (of power or connectivity) can still bring any infrastructure down.

One notable EC2-based service that was not disrupted (according to Mashable at least) was Netflix, because they built sufficient redundancy to handle an entire data center's failure. That's the obvious lesson for customers of any hosting service: if 24/7/365 uptime of your service is absolutely critical, then build in massive redundancy. That applies if you're hosting on physical servers or in the cloud. Redundancy is complicated, and expensive, and like an insurance policy, only seems worthwhile in a crisis. So it's probably not worth the cost for most applications.

I'm also somewhat fatalistic about infrastructure in general: it's all very fragile. And the more complex and interdependent our systems become, the more points of failure we introduce. Redundancy itself is kind of two steps forward, one step back, simply because it adds complexity.