Saturday, April 23, 2011

Failure is the new Success

This week was a real eye opener in the IT world. Amazon's EC2 cluster suffered a massive meltdown (just Google Amazon EC2 I wont bother with the links). The media was all over it about how half the internet went down and major sites were DOA for 24-48 hours ... Massive Collapse. Like the Japanese nuclear reactors (except the part where the internet being down didn't actually hurt anyone except financial ... or did it .. but thats another story).

But like the Japanese disaster, there is a hidden success story. The world didn't come to an end. The Japanese reactors didn't actually explode in an apocalyptic meltdown and kill everyone on earth turning it into the green glowing dawn of the living dead. The system was actually contained and it was mainly the news media which focused on the drama while ignoring the real tragedy of 12,000+ dead due to a tsunami. I guess thousands dead due to a natural disaster isn't as exciting as no one actually dead in a nuclear reactor that didn't actually explode. "But it could have! " ... well it didn't. The hidden success in the Japan Disaster is that in fact the reactors stood up to 9+ quake AND a tsunami.

Similar with the Amazon story. The media wants us to think the cloud suffered a meltdown taking down the internet and we cant trust it anymore. The reality is that Amazon worked exactly as advertised ... well maybe off by a few 9's but whats statistics. The reality is that a single geographical region suffered a 'network meltdown' taking down the whole region for a while, then within 12 hours only a single 'availability zone' As of now (2011-04-23 19:39:00 EST) Amazon is mostly but not entirely up. But it actually worked ! Other zones were unaffected. A single geographic zone went dark for 48+ hours but the rest kept chugging on. And in fact it seems that all data is restored, nothing lost.

So who suffered ? those vast number 'mainstream internet sites taken down' ? Well the ones that suffered were the ones that didn't plan for failure. Netflix is based on Amazon but kept on chugging because it was designed for the conditions Amazon advertises. That is a single geographic region might actually die so don't put all your chickens in one basket. The people that didn't understand that were hurt. Those that embraced failure rode the wave. Even those that didn't embrace the failure were just down for a while and amazon restored their data and servers from redundant storage ... it just took a while.

So whats to learn ? The fact is there is no totally perfect system. I don't think any company could build a data-center better then Amazon ... Google and Microsoft might equal it but no ones perfect. Failures are going to happen. Period.
Amazon was designed for geographic redundancy and exposes the necessary API's to take advantage of it. If you don't, its your fault.
well maybe ...

Maybe not. The whole Cloud Computing concept is based on redundancy and trust in large scale distributed systems. Its the outsourcing of the IT department. Why should individual developers who subscribe to a cloud system be required to manage this ? While I agree that Amazon performed exactly as it advertised, and that if you planned for it and took advantage of its capabilities you'd have rode out the storm just fine ... it points to a weakness in the system. I argue that cloud computing should hide this from you. Its an artificial artifact that a "virtual machine" actually resides somewhere physically and that you have to care. The next generation of cloud computing should hide this from the users much like the Amazon S3 storage (which hides its physical location), the EC2 and EBS system should be able to migrate to different geographic locations without the programmer having to architect dynamic load balancing, fault torrence and hot swap failover. Isn't that the whole point of "the cloud" ? To let the 'big boys' figure that out and leave us to writing apps ?

Maybe thats the real point of clouds like Google Cloud Computing ? Its time Amazon wake up and accept its awesome but still needs to go the extra mile. If the biggest web sites crashed because they failed to make use of the advanced features of cloud computing, maybe its time to make those features less "advanced".

2 comments:

David Lee said...

Very good post-mortem

http://aws.amazon.com/message/65648/

Tim Finney said...

Ahem. The reactors did explode. Just like Chernobyl -- not a nuclear reaction but a chemical one. The end result was the same in both cases: ruptured containment vessels, a Very Bad Thing. (Having fuel rods stored above the containment vessel wasn't a great idea. Someone is going to have to go and pick those up. I wonder how many of them broke?)