Just a couple of days after posting about cloud native applications Amazon raised the bar by having some issues in one of their data center regions. These issues primarily affected EBS and RDS from what I’ve read. So, pretty much everything one way or another since using AWS EC2 without EBS in any form for most applications that exist today is a little wacky for most folks. This is because your EC2 AMI won’t persist through a reboot in the absense of the use of EBS. Most folks have not reached the operational nirvana yet of full automated configuration management and application fault tolerance that makes this acceptable for them.
What level of SPOF (Single Point of Failure) are you are willing to tolerate. So, I wanted to “scale up” the idea of the SPOF then bring it back down again. Here we go.
If the earth stops working, so will your web application (admittedly there might be some satellite networks that don’t have this problem... but who cares at that point?)
So, let’s keep going. Each of these is a potential single point of failure.
Earth > Continent > Country > State/Region > City > Neighborhood > Building > Floor > Room > Rack > Server > Server Component
And, at each tier, there are numerous dependencies and contexts to keep your service running at any given time. There are the obviously ones like the above example where if the earth explodes the neighborhood is pretty much shot to hell also. But, that’s obvious. It’s gets less obvious when you dig deeper into the data center and see that there are 5 servers so that’s okay right? Maybe. Maybe not. If it is something like.
Dynamic Name Service > Load Balancer > Web Server > Application Server > Database Server
Then those 5 servers/services might be in that one rack per data center per room per building per neighborhood per city per state per country per continent per planet is looking pretty vulnerable. In the grand scheme of things the loss of one power supply in one machine could impact the entire planet’s capacity to retrieve whatever is on that DB that is so globally important; like a picture of your kid making a funny face on his 2nd birthday.
Do you think it is Amazon AWS’s fault if you put that database on one server in one rack in one place with no reasonable SLA and it goes away forever? Not so much. You are accountable and responsible. You made that choice.
Now, how can we change this for the better? We can develop applications that are able to tolerate the loss of a single point of failure at a sufficient granuality (Earth is a bit extreme today) such that our applications keep running when bad things like the AWS outage occur. I call these Cloud Native Applications. They have certain traits that should look a little familiar to cloud folks.
You cannot create a cloud native application doing things the same way you always have before. It simply will not work. The necessary software architecture and systems architecture has changed if you want your application to run on the cloud w/ no SPOFs.
Just needed to get that off my chest. Some related links for good reading:
http://blog.basho.com/2011/04/21/Amazons-outage-proves-riaks-vision/
http://www.thestoragearchitect.com/2011/04/22/so-your-aws-based-application-is-down-dont-blame-amazon/
http://highscalability.com/blog/2011/4/22/stuff-the-internet-says-on-scalability-for-april-22-2011.html
http://www.infoq.com/news/2011/04/amazon-ec2-outage
And if your REALLY keen to write some CNA’s (contact me) and read...
http://www.infoq.com/presentations/Actor-Thinking
http://www.infoq.com/presentations/1000-Year-old-Design-Patterns
Update on 2011-04-25 02:39 by Kent Langley
I follow George Reese on twitter and just ran across a tweet about this post. I thought it was worth noting:
The AWS Outage: The Clouds Shining Moment
http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html
While I do not necessarily agree with everything posted there I do like George's way of thinking. I would say that he said it all in the last sentence.
"These kinds of failures don't expose the weaknesses of the cloud—they expose why the cloud is so important."
Update on 2011-04-26 21:03 by Kent Langley
And, another good write up imho.
http://stu.mp/2011/04/the-cloud-is-not-a-silver-bullet.html
Key point?
"A lot more effort needs to go into documenting proper cloud architecture. The cloud has changed the game. There are new tools and, as a result, new ways of building systems in the cloud. Case studies, diagrams, approved tools, etc. should all be highlighted, documented, and preached about accordingly."
He says that, "The Cloud is not a silver bullet..." I certainly agree. But, it is a gold mine of opportunity if you choose to avail yourself of it's strengths and deploy cloud native applications like Netflix, Amazon themselves, and SmugMug appear to have done fine jobs of for themselves.