cloud computing

What is to Show for Five Years of Cloud Computing?

Just a few short years ago launching a virtual machine in the cloud was a simple and basic. With a couple of API calls and maybe a button click or two you were up and running in a just a few minutes. The choices were limited but it was nice. You could even get a little storage to go along with the instance. Just before that we were leasing, renting and co-locating dedicated physical hardware in data centers and it took weeks to order, provision, deploy, and set up the gear. Fast forward to today and we are now full-on in a cloud computing revolution redefining how technology is deployed. There are so many choices and so many of them good that it can be complete overwhelming to those trying to make sense of it all. On top of it all, every day I meet people who have never deployed anything in “the cloud.” It’s just as easy as ever to launch a machine but there is so much more available to the on-demand computing as a service would be client today. I was trying to think of what really is different or better today than it was five years ago. What’s really new to show for 5+ years of cloud computing innovation and effort?

First and foremost, people figured out that cloud computing is good for something really important (meaning people not Google). They figured out that the cloud in its various forms is phenomenal for capturing and processing what has come to be known as big data. This is a really important point. It’s never been easy, and still isn’t, to aggregate and process voluminous, high speed, or wildly unstructured data. In fact, prior to cloud computing coming of age it was down right impossible fiscally and technically. Now, it’s all there at the click of a few buttons as pretty as you like. You can now spin up a super computer for just a few dollars an hour to crunch even your most gnarly data sets.

A second fairly dramatic improvement is in the category of orchestration of resources. There are far more resources available to orchestrate for an infinite number of purposes but doing so has never been easier (not, I did not say easy). Due to the proliferation in understanding of the creation and consumption of API’s you can now quite literally with a single set of tools launch a server at several different cloud providers, geo locations, and even operating system varieties if so you wanted and if you’re clever with tools like Puppet, Chef, Cloud Formation, Cloud Foundry or others you can do it all from the comfort of your very own laptop in just a few minutes. You can quickly and relatively easily, historically speaking, compose masses of servers into useful services for nearly anything you can dream!

A third thing that’s changed is the raw power available via a command line or cloud console and in the newer implementations of older software architectures. You can now, in just a few moments, provision a server with 244 GiB memory and high speed 10 Gigabit Ethernet. And, that is just a building block to the real power. The real power comes as a result of massive improvements and capabilities in the arena of distributed computation, storage, and software defined networking. This allows you to provision dozens to thousands of these types of machines relatively on a whim. Frankly, not many people can even figure out what to do with all this power even if they do know how to provision it today. This has forced software architects and engineers to push forward much faster with zeal and learn how to write distributed applications and in many cases, the occasion is being met. So, raw power in both virtualized hardware and the software that can be deployed on it has come a very long way.

In summary, cloud computing had already been brewing for decades with its roots reaching far back in time. Grids, clusters and more were all precursors. However, it is striking how far things have come in just about five years. There has been unprecedented improvement and what feels like ever increasing speed of improvements. Good times indeed.

It's 2013! Things Break, Services Falter. Move Forward.

It's a New Year, I have the cloud, but I still have many of the same old Single Points of Failure.

It's known that a single point of failure (SPOF) is a risk. It's an Achilees heel so to speak. That goes for people, companies, planets, AMI's, AZ's, Regions, Countries, or beers in the fridge. Whatever processes you have to do your general day to day work should be able to deal with known SPOF's and be flexible enough to assimilate and adjust to newly found failure modes. But, and this is important, there is a substantial cost associated with eliminating certain SPOF's. Let's say you decided that you no longer are accepting of having Earth be an SPOF for your awesome blog. Well, in that case, you need a space program, and an interplanetary network that puts this desire out of reach unless you are NASA, Elon Musk, or Richard Branson. Admittedly, that is an extreme example but my point is that your tolerance for risk and downtime must be considered carefully for any technology for which you have implicit or assumed service level agreements with your users. Let's think about Netflix for a moment.

Netflix's service was severely impacted this last Christmas Eve by an outage affecting AWS ELB's in their US East region. Based on my arms length information about Netflix operations through what I've read that is public, in my opinion and far more than most organizations, Netflix understands this cost/benefit of utilizing AWS. They say themselves in a recent post:

"Our strategy so far has been to isolate regions, so that outages in the US or Europe do not impact each other."

"Netflix is designed to handle failure of all or part of a single availability zone in a region as we run across three zones and operate with no loss of functionality on two." Source: http://techblog.netflix.com/2012/12/a-closer-look-at-christmas-eve-outage.html

Netflix clearly understands the risk and still they have chosen to take it despite the known risks. They were completely at the mercy of AWS in this last outage since the failure was regional in nature and their systems do not allow for multi-regional failover within a country for a single user account or group of accounts YET; but they are working on it.

As an AWS client, they do have a reasonable expectation as a customer that the underlying primitives they use from AWS to compose their services will work reliably. In this case, that primitive was Elastic Load Balancers. Like an AMI is a virtual server, an ELB is something of a virtual load balancers. In VPC's ELB's can span AZ's but then, the ELB is an SPOF unless your service is capable of re-initializing an ELB dynamically when it ceases to serve its purpose and can then re-route traffic accordingly. This is non-trivial but can also likely be dealt with if you understand the various intricacies of geo aware anycast backed DNS services.

Someone asked me if the AWS outages of 2012 would make me re-think my plans for cloud computing in 2013. This does not change my cloud plans for 2013 in any way. But, to be clear, even though I really like AWS, AWS is not the cloud and the cloud is not AWS. AWS is a big and deeply important part of the cloud ecosystem. I'm quite thankful for all they've done to further the understanding of cloud around the world. They are likely to stay on top, from my point of view, for a long while. I and my teams deployed large amounts of AWS in 2012 supporting the services of numerous clients.

I don't think these outages will cause any meaningful pause in most cloud plans for 2013 for anyone who takes the time to understand these sorts of situations and doesn't just fall prey to FUD (Fear, Uncertainty, Doubt) and really is serious about moving to a cloud computing model will keep marching forward. It's not perfect but the benefits to business and technical agility far outweigh the risks and knowledge ramp up investment that is necessary to make full use of cloud computing.

Things break and outages happen. There are very few systems where this is not true and those systems have been designed specially to deal with an extreme need for continuous availability. Especially complex systems and systems deployed at a large scale like AWS can break in interesting ways. It's not so much that things break that is so bad. It is what is done next that matters to keep the same things from breaking again and again. AWS does a pretty good job on this front in my opinion. It performs, communicates, and adjusts far better than most hosting providers I have historical experience working with over the last 15 years or so. They have raise the bar substantially.

Regarding AWS's IaaS services. It is AWS's job to provide a reasonable SLA and maintain it. It is up to users of the services to provide their users with services that have a reasonable SLA and maintain. Decoupling the service from the server is at the heart of the accelerating innovation in hosting of internet connected services that began quite some time ago and now marches under the banner of cloud computing. Now, if you use their PaaS services, it's a bit of a different situation but that's the subject of a whole different discussion I suspect.

Supporting Information Blast from ProductionScale's blog past is contained within the following older posts of mine (in no particular order):

The Traits of a Modern IT Organization, 8/2008
Thoughts on the Business Case for Cloud Computing, 4/2009
Get Your Head in the Clouds, 4/2008
Why Should Businesses Bother with Cloud Computing, 3/2009