distributed

On Clouds and SPOF’s (or the Great AWS Outage of April 2011)


Just a couple of days after posting about cloud native applications Amazon raised the bar by having some issues in one of their data center regions. These issues primarily affected EBS and RDS from what I’ve read. So, pretty much everything one way or another since using AWS EC2 without EBS in any form for most applications that exist today is a little wacky for most folks. This is because your EC2 AMI won’t persist through a reboot in the absense of the use of EBS. Most folks have not reached the operational nirvana yet of full automated configuration management and application fault tolerance that makes this acceptable for them.

What level of SPOF (Single Point of Failure) are you are willing to tolerate. So, I wanted to “scale up” the idea of the SPOF then bring it back down again. Here we go.

If the earth stops working, so will your web application (admittedly there might be some satellite networks that don’t have this problem... but who cares at that point?)

So, let’s keep going. Each of these is a potential single point of failure.

Earth > Continent > Country > State/Region > City > Neighborhood > Building > Floor > Room > Rack > Server > Server Component

And, at each tier, there are numerous dependencies and contexts to keep your service running at any given time. There are the obviously ones like the above example where if the earth explodes the neighborhood is pretty much shot to hell also. But, that’s obvious. It’s gets less obvious when you dig deeper into the data center and see that there are 5 servers so that’s okay right? Maybe. Maybe not. If it is something like.

Dynamic Name Service > Load Balancer > Web Server > Application Server > Database Server

Then those 5 servers/services might be in that one rack per data center per room per building per neighborhood per city per state per country per continent per planet is looking pretty vulnerable. In the grand scheme of things the loss of one power supply in one machine could impact the entire planet’s capacity to retrieve whatever is on that DB that is so globally important; like a picture of your kid making a funny face on his 2nd birthday.

Do you think it is Amazon AWS’s fault if you put that database on one server in one rack in one place with no reasonable SLA and it goes away forever? Not so much. You are accountable and responsible. You made that choice.

Now, how can we change this for the better? We can develop applications that are able to tolerate the loss of a single point of failure at a sufficient granuality (Earth is a bit extreme today) such that our applications keep running when bad things like the AWS outage occur. I call these Cloud Native Applications. They have certain traits that should look a little familiar to cloud folks.

You cannot create a cloud native application doing things the same way you always have before. It simply will not work. The necessary software architecture and systems architecture has changed if you want your application to run on the cloud w/ no SPOFs.

Just needed to get that off my chest. Some related links for good reading:

http://blog.basho.com/2011/04/21/Amazons-outage-proves-riaks-vision/

http://www.thestoragearchitect.com/2011/04/22/so-your-aws-based-application-is-down-dont-blame-amazon/

http://highscalability.com/blog/2011/4/22/stuff-the-internet-says-on-scalability-for-april-22-2011.html

http://www.infoq.com/news/2011/04/amazon-ec2-outage

And if your REALLY keen to write some CNA’s (contact me) and read...

http://www.infoq.com/presentations/Actor-Thinking
http://www.infoq.com/presentations/1000-Year-old-Design-Patterns

 

Update on 2011-04-25 02:39 by Kent Langley

I follow George Reese on twitter and just ran across a tweet about this post. I thought it was worth noting:

The AWS Outage: The Clouds Shining Moment
http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html

While I do not necessarily agree with everything posted there I do like George's way of thinking.  I would say that he said it all in the last sentence.

"These kinds of failures don't expose the weaknesses of the cloud—they expose why the cloud is so important."


Update on 2011-04-26 21:03 by Kent Langley

And, another good write up imho.

http://stu.mp/2011/04/the-cloud-is-not-a-silver-bullet.html

Key point?

"A lot more effort needs to go into documenting proper cloud architecture. The cloud has changed the game. There are new tools and, as a result, new ways of building systems in the cloud. Case studies, diagrams, approved tools, etc. should all be highlighted, documented, and preached about accordingly."

He says that, "The Cloud is not a silver bullet..."  I certainly agree.  But, it is a gold mine of opportunity if you choose to avail yourself of it's strengths and deploy cloud native applications like Netflix, Amazon themselves, and SmugMug appear to have done fine jobs of for themselves.

Cloud Native Applications

I’ve always believed that cloud computing is really two things. One, it is a technology architecture. Two, it is a business operating paradigm that we often call on-demand. Your application must satisfy the on-demand business model requirements satisfy the technical architecture requirements to be a cloud native application. There are not very many cloud native applications running in the world today. This is changing quickly. A Cloud Native Application is architected and designed to run on what is commonly referred to as a cloud IaaS or PaaS. The words I used there are very important. It is architected and designed to run in the cloud from the beginning. Therefore, it has a number of important traits as part of it’s DNA. The traits that a Cloud Native Application must have are: