IT

Factoring Complexity

For many years now when building scalable and highly available computing infrastructures I've been doing something I call Factoring Complexity. I think a lot of this is very well served by using agile project methods and ITSM (such as ITIL) concepts.  There may be better names for this but I'll stick with it for now.

I do this to achieve the least common denominator systems architecture to efficiently provide the required business service. This includes processes, people, computers, and code. To understand what I mean be this and how I approach two cross-discipline definitions are required. I borrowed these from some of my math classes from long ago.

From www.algebrahelp.com a definition for factoring reads as follows:

Factoring is an important process in algebra which is used to simplify expressions, simplify fractions, and solve equations.

I typically aim to Factor complexity out of infrastructure and systems or software in general for the lowest number of interconnections between components that is able to reliably perform the task at hand, like serve a website, that provides appropriate scalability and performance.

In the context of building web systems Factoring Complexity serves several purposes. Two that stand out are resource application (time and money) and maintainability. From a business perspective it is important to use the appropriate level of resources to solve any particular business problem. From a practical point of view, things should be maintainable.  The successful output of a round of Factoring Complexity is usually less connections between components. It is the least complex system that can adequately provide for any particular business need.

From algebrahelp.com a definition for factoring reads as follows:

Factoring is an important process in algebra which is used to simplify expressions, simplify fractions, and solve equations.

From wikipedia the least common denominator is seen to be defined as follows:

Q: The term is used figuratively to refer to the "lowest"—least useful, least advanced, or similar—member of a class or set which is common to things that relate to members of that class.

When attempting to factor complexity I strive to focus on a few key things. They are documentation, relationships, and advance planning. I'll take each of these in turn. But first, let me show a very simple and visual example of what I am talking about.

Figure 1 is a rather complex system with a lot of interconnections. Figure 2 is a less complex system with many less interconnections. Figure 3 is a very simple system with only 1 interconnection. I illustrate this in this way to explain something that is often missed. The complexity is not as much in the nodes themselves but in how they interact and interconnect. The complexity increases dramatically every single time you add a node. Nodes can be software programs, development frameworks, servers, people, network connections, anything really; anything that might interact in some way with some other node.

In the context of refactoring existing web infrastructure, one way to attack overly complex systems is to focus on the interconnections. If you can begin to eliminate the need for interconnections without negatively impacting performance, scalability, availability, and capacity then you have done an excellent thing. You have reduced the overall complexity of the system while maintaining or even improving the systems manageability and capability as a whole. Sometimes people don't believe this is possible when they first see the concept. But, I assure you that is because I have personally used this concept successfully many times professionally and personally. One might equate the number of interconnections to the level of overall effectiveness of any given environment.

Documentation is a critical factor in getting any environment under control and getting to a position of relative stability and predictability. In the context of Factoring Complexity I am primary talking about first documenting all of the known and discoverable components of a giving system as much as possible and at an appropriate level of detail. Then, even more importantly, documenting the relationships of one component to another in as much detail as possible. These relationships are very important for the next steps.

If you want to get started factoring complexity in your compute environments then there are three key things to document. One, physical components. Two, abstract services. Three, relationships of these items. In ITIL these are called CI's, Configuration Items. The relationships between CI's is what we are looking for here. The are tracked in a CMDB. CMDB is a nasty four letter word in some places because they can be very challenging to implement. The CMDB, or Configuration Management DataBase, is in essence a social graph. There are some great pieces of software finally emerging that can easily handle the kinds of complexity and number of interconnections that many IT environments present. One of the more interesting ones to me is Neo4j. It is, in the sites words,

Neo4j is a graph database. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. A graph (mathematical lingo for a network) is a flexible data structure that allows a more agile and rapid style of development.

But, that's all for now, how to make Neo4j into a killer CMDB is just a idea for now as far as I know. It makes a lot of sense though doesn't it?

On Clouds and SPOF’s (or the Great AWS Outage of April 2011)


Just a couple of days after posting about cloud native applications Amazon raised the bar by having some issues in one of their data center regions. These issues primarily affected EBS and RDS from what I’ve read. So, pretty much everything one way or another since using AWS EC2 without EBS in any form for most applications that exist today is a little wacky for most folks. This is because your EC2 AMI won’t persist through a reboot in the absense of the use of EBS. Most folks have not reached the operational nirvana yet of full automated configuration management and application fault tolerance that makes this acceptable for them.

What level of SPOF (Single Point of Failure) are you are willing to tolerate. So, I wanted to “scale up” the idea of the SPOF then bring it back down again. Here we go.

If the earth stops working, so will your web application (admittedly there might be some satellite networks that don’t have this problem... but who cares at that point?)

So, let’s keep going. Each of these is a potential single point of failure.

Earth > Continent > Country > State/Region > City > Neighborhood > Building > Floor > Room > Rack > Server > Server Component

And, at each tier, there are numerous dependencies and contexts to keep your service running at any given time. There are the obviously ones like the above example where if the earth explodes the neighborhood is pretty much shot to hell also. But, that’s obvious. It’s gets less obvious when you dig deeper into the data center and see that there are 5 servers so that’s okay right? Maybe. Maybe not. If it is something like.

Dynamic Name Service > Load Balancer > Web Server > Application Server > Database Server

Then those 5 servers/services might be in that one rack per data center per room per building per neighborhood per city per state per country per continent per planet is looking pretty vulnerable. In the grand scheme of things the loss of one power supply in one machine could impact the entire planet’s capacity to retrieve whatever is on that DB that is so globally important; like a picture of your kid making a funny face on his 2nd birthday.

Do you think it is Amazon AWS’s fault if you put that database on one server in one rack in one place with no reasonable SLA and it goes away forever? Not so much. You are accountable and responsible. You made that choice.

Now, how can we change this for the better? We can develop applications that are able to tolerate the loss of a single point of failure at a sufficient granuality (Earth is a bit extreme today) such that our applications keep running when bad things like the AWS outage occur. I call these Cloud Native Applications. They have certain traits that should look a little familiar to cloud folks.

You cannot create a cloud native application doing things the same way you always have before. It simply will not work. The necessary software architecture and systems architecture has changed if you want your application to run on the cloud w/ no SPOFs.

Just needed to get that off my chest. Some related links for good reading:

http://blog.basho.com/2011/04/21/Amazons-outage-proves-riaks-vision/

http://www.thestoragearchitect.com/2011/04/22/so-your-aws-based-application-is-down-dont-blame-amazon/

http://highscalability.com/blog/2011/4/22/stuff-the-internet-says-on-scalability-for-april-22-2011.html

http://www.infoq.com/news/2011/04/amazon-ec2-outage

And if your REALLY keen to write some CNA’s (contact me) and read...

http://www.infoq.com/presentations/Actor-Thinking
http://www.infoq.com/presentations/1000-Year-old-Design-Patterns

 

Update on 2011-04-25 02:39 by Kent Langley

I follow George Reese on twitter and just ran across a tweet about this post. I thought it was worth noting:

The AWS Outage: The Clouds Shining Moment
http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html

While I do not necessarily agree with everything posted there I do like George's way of thinking.  I would say that he said it all in the last sentence.

"These kinds of failures don't expose the weaknesses of the cloud—they expose why the cloud is so important."


Update on 2011-04-26 21:03 by Kent Langley

And, another good write up imho.

http://stu.mp/2011/04/the-cloud-is-not-a-silver-bullet.html

Key point?

"A lot more effort needs to go into documenting proper cloud architecture. The cloud has changed the game. There are new tools and, as a result, new ways of building systems in the cloud. Case studies, diagrams, approved tools, etc. should all be highlighted, documented, and preached about accordingly."

He says that, "The Cloud is not a silver bullet..."  I certainly agree.  But, it is a gold mine of opportunity if you choose to avail yourself of it's strengths and deploy cloud native applications like Netflix, Amazon themselves, and SmugMug appear to have done fine jobs of for themselves.

Scalability as a Discipline

Just a very quick post to point out a brief blog entry regarding the concept of scalability as a discipline and the scalability architect as an formal role.

AKF Partners posted a blog entry discussing this issue that is near and dear to my heart.  Of particular impact I thought was the concept that the Scalability Architect is also a teacher and has the purview to educate others about scalability concepts.  I couldn't agree more!

Take a moment and check out their post.

http://akfpartners.com/techblog/2010/11/09/scalability-as-a-discipline

 

SPOF (Single Point of Failure) Analysis

When planning a system or taking on the analysis of a system that is already in place to begin preparations for scaling there are a few key things one must do.  One of those tasks is to perform a Single Point of Failure (SPOF) analysis.  SPOF’s are the enemy of availability for any system.  This is an exercises done with the input of a cross functional team of key persons from operations, business, and development.  The goal of this analysis is not to actually do the work but to identify the work that needs to be done in the context of the business goals.

Doing this analysis goes through phases similar to many technology projects.  They are usually something such as:
  • Define the Goals of the Analysis
  • Design the Plan to Achieve the Goals
  • Execute the Plan
  • Analyze the Data
  • Produce Report of Results

These steps may vary a bit depending on the organization size, team, available resources, and the size of the environment.  But, in general, SPOF analysis will follow that pattern.

Some other important points to consider when thinking about SPOFs:

It’s can be better to have SPOF analysis done by a 3rd party who is actually less familiar with the systems but has proper relevant experience.  Those that are very close to a system have a tendancy not to see things because they are too near.

Having SPOF’s does not necessarily mean that someone made a mistake.  It is not a weapon.  If you treat it this way you can be sure people will not report SPOF’s when they find them.  Often times we live with SPOF’s on purpose due to resource limitations or opportunity costs reasons.  If fixing an SPOF problem will cost a million dollars it might be better to accept that potential down time is the better outcome if something goes wrong.  Weather this is or is not true in any given situation is complex business question much more than it is a technical matter.

The report that is produced as the output result of an SPOF audit should be reviewed by the entire technology and business team to determine the potential impact to the business and then entered into whatever passes for a backlog and technology architectural review board so that they can be properly analyzed, ranked, and put in line to be fixed.

SPOF analysis should be done periodically throughout a product/service life cycle.  Things change every single day.  Last years SPOF analysis is probably no longer valid.  Comparing SPOF analysis’ over time can be very enlightening as well toward finding endemic problems that consistently get swept under the rug.

SPOF analysis does not just apply to technology.  It also applies toward business organizations.  One of the people that I’ve noted understands this far better than most is Warren Buffet.  I think he had a clearly articulated (albeit secretive) planned succession strategy when I was still in diapers.  Even at Berkshire-Hathaway Warren Buffet himself has made sure that he is not a single point of failure;  a true visionary.

Some Cloud Thoughts on a Clear and Sunny Day

Cloud Computing is a deployment model and cloud computing is a business model.  Cloud computing is not some silver bullet magical thing.  It's not even easy *gasp* sometimes.

As a deployment model cloud computing can it is simply summed up as on-demand, self-service, reliable, and low to no capital costs services for the consumer.

As a business model it is summed up as, again, low to no long term capital costs (and the associated depreciation) and pay as you go service provider pricing models.  In reality these are mountains of micro transactions aggregated into monthly and yearly billing cycles.  For example, I spent $0.015 for a small compute instance w/ a cloud infrastructure provider because I just needed an hour of an Ubuntu 10.04 linux machine to test a quick software install combination and update a piece of documentation.  I'll get a bill for that at the end of the month.  Get this...

An hour of compute time costs me 3.3 times LESS than a piece of hubba bubba chewing gum cost me at $0.05 (one time use only) over 30 years ago. #cloud

Enterprises and service providers are learning very quickly from the how the early public cloud vendors how to do things differently and often more efficiently.  It was well summed up in the Federal CTO's announcement of the government application cloud.  Basically, that we saw that consumers could get IT services for orders of magnitude less than we could.  So, we're fixing that by emulating what the companies that service the consumers are doing. Smart.  Bechtel did this exact same thing years ago when analyzing that the cost per GB of storage for Amazon was orders of magnitude less than Bechtel could and asked the very important question why and then answered it very well.
A couple of years ago now I helped found a company called nScaled.   nScaled does, business continuity as a service.  It is only possible with the resources, at the price, and at the speed we have moved because of following cloud computing deployment and business models.  It would not have been possible for us to build this business when we did and the way we have without these models.  
In March 2008 I called cloud computing a renaissance.

It is my opinion that Cloud Computing is a technology architecture evolution that, when properly applied to business problems, can enable a business revolution. I've been saying this for a while but in recent weeks I have actually come to prefer the term renaissance over revolution.

Today, two years into a startup that uses the raw power of cloud computing deployment and business models across the board to enable new ways for companies to consume disaster recovery and business continuity solutions I can say without a doubt that I believe that cloud computing is a renaissance more than ever before!