Amazon S3 suffers an 8hr outage; Trust no one

This is rather is timely considering my talk last week at London's CloudCamp. I was the last one to speak. I spoke about the need to take responsibility for your cloud infastructure and not rely on any of the marketing hype from any of the companies. I painted real world stories and explained how we utilise two providers (Amazon and Flexiscale) for our cloud requirements to insure against complete black out.

The previous speakers all towed the "cloud computing" is great company line, your eggs are safe in our basket type of pitches. Including a quick talk from Martin Buhr from Amazon. He took the usual route of explaining how great Amazon was at scaling out very quickly, demonstrating how some projects scaled up to 1000 of instances in some cases.

binary easter egg

But the problem with these examples is that by in large they are artificial. Many do not need to do this one-off 'batch' processing. If one or two instances out of a 1000 go down then who cares? Or even if the network goes down, who cares, the processing will still happen behind Amazon's firewall. Even, in the worse case, the system dies completely, they can simply restart the batch.

But for users that demand as near to 100% uptime as possible, then the cloud story gets a whole lot cloudier.

Only a couple of weeks ago, Amazon's SimpleDB suffered a short outage, resulting in all queries returning null. Back in February 2008, S3 suffered an outage. This weekend, Amazon has once again been down for nearly 8 hours.

As you can imagine their forums are alight with lots of angry customers, although their focus is misguided. Amazon is not at fault, they are.

Just because you use S3 or EC2, why do you assume all the usual rules of server management go out the window? They still apply. If you were hosting your own file space, then you would have it backed up and if it was crucial to you business, then you would have an alternative.

If, like the vast majority, you fly by the seat of your pants, relying on a single thing, then expect your pants to catch fire every so often. Blame no one but yourself.

As I illustrated in my CloudCamp talk, outages can and do happen. So it is up to you to figure out what to do when things go horribly wrong. In this particularly instance, no amount of support money to Amazon would have fixed this problem any quicker. So save yourself the support cost, and instead, build out an alternative path should your primary cloud provider take a nose dive.

Trust no one. Use all your years of server hosting experience and apply it to the cloud.

Incidentally, I will be talking at the Cloud Computing Expo in November at San Jose, and in that session I go over the logistics of making a truly redundant cloud architecture.


Recent Cloud posts

Recent JAVA posts

Latest CFML posts

Site Links