On Clouds and SPOF’s (or the Great AWS Outage of April 2011)


Just a couple of days after posting about cloud native applications Amazon raised the bar by having some issues in one of their data center regions.  These issues primarily affected EBS and RDS from what I’ve read.  So, pretty much everything one way or another since using AWS EC2 without EBS in any form for most applications that exist today is a little wacky for most folks.  This is because your EC2 AMI won’t persist through a reboot in the absense of the use of EBS.  Most folks have not reached the operational nirvana yet of full automated configuration management and application fault tolerance that makes this acceptable for them.

What level of SPOF (Single Point of Failure) are you are willing to tolerate.  So, I wanted to “scale up” the idea of the SPOF then bring it back down again.  Here we go.

If the earth stops working, so will your web application (admittedly there might be some satellite networks that don’t have this problem... but who cares at that point?)

So, let’s keep going.  Each of these is a potential single point of failure.

Earth > Continent > Country > State/Region > City > Neighborhood > Building > Floor > Room > Rack > Server > Server Component

And, at each tier, there are numerous dependencies and contexts to keep your service running at any given time.  There are the obviously ones like the above example where if the earth explodes the neighborhood is pretty much shot to hell also.  But, that’s obvious.  It’s gets less obvious when you dig deeper into the data center and see that there are 5 servers so that’s okay right?  Maybe. Maybe not. If it is something like.

Dynamic Name Service > Load Balancer > Web Server > Application Server > Database Server

Then those 5 servers/services might be in that one rack per data center per room per building per neighborhood per city per state per country per continent per planet is looking pretty vulnerable.  In the grand scheme of things the loss of one power supply in one machine could impact the entire planet’s capacity to retrieve whatever is on that DB that is so globally important; like a picture of your kid making a funny face on his 2nd birthday.

Do you think it is Amazon AWS’s fault if you put that database on one server in one rack in one place with no reasonable SLA and it goes away forever?  Not so much.  You are accountable and responsible.  You made that choice.

Now, how can we change this for the better?  We can develop applications that are able to tolerate the loss of a single point of failure at a sufficient granuality (Earth is a bit extreme today) such that our applications keep running when bad things like the AWS outage occur.  I call these Cloud Native Applications.  They have certain traits that should look a little familiar to cloud folks.

You cannot create a cloud native application doing things the same way you always have before.  It simply will not work.  The necessary software architecture and systems architecture has changed if you want your application to run on the cloud w/ no SPOFs.

Just needed to get that off my chest.  Some related links for good reading:

http://blog.basho.com/2011/04/21/Amazons-outage-proves-riaks-vision/

http://www.thestoragearchitect.com/2011/04/22/so-your-aws-based-application-is-down-dont-blame-amazon/

http://highscalability.com/blog/2011/4/22/stuff-the-internet-says-on-scalability-for-april-22-2011.html

http://www.infoq.com/news/2011/04/amazon-ec2-outage

And if your REALLY keen to write some CNA’s (contact me) and read...

http://www.infoq.com/presentations/Actor-Thinking
http://www.infoq.com/presentations/1000-Year-old-Design-Patterns