Datafication

Datafication - term use to describe the ability to now capture and quantify many aspects of the world as data.

The following chart shows usage of the term "Big Data" overlaid with "Cloud Computing" according to Google Trends.

 

This is important. You can see, they have merged. Cloud Computing architectures are foundational for Big Data to exist in the forms and for the purposes we are seeing it evolve. Cloud Computing is an embodiment of the skills we have now begun to master for managing and marshaling computational resources in agile ways to developing and deploy Information Technologies that are capable of solving business problems quicker, more affordably, and more completely than ever before.

Vast quantities of computational resources are relatively easily accessible. This is relative to the cost and effort it took to marshal them only a few years ago. This is by no means self-evident to most and it is still very difficult to do at scale. The necessary skills in software and systems engineering are still less than common that some might like.

This article is inspired by an article by Irving Wladawsky-Berger on his blog. His article is inspired by a book that I just purchased and will be reading as quickly as I can do so. In his article he summarizes three profound changes that have been occurring in the context of Big Data. I have have personally run into all of these in my work building solutions to deal with big data. I'll summarize the definitions here and put them in a client centric context since I deal with that often. Read the full post and then the book to dig into the the nitty gritty details. They are:

n = all

It used to be okay if you said something like, I'll take 10% of the total data and use to that to infer in a statistically significant way what is going on relative to your data. Now, this is no longer okay. We have all the data, we can store all the data, and it's not good enough to only look at 10% of the data. We want n = all for our sample size. However, there are caveats.

 Accepting Messiness

Sidebar

As an aside, one thing I'm seeing is that there is a LOT of data that's just been sitting around for quite some time now waiting for its moment in the sun so to speak. I think we are finally there. I'm working on some projects for a client now along these lines and there certainly appear to be many more such project in the future. 

 

Unstructured data can be quite messy. What this means is that sometimes you are not going to be able to us the a standard structured query language (SQL) and relational database formats to get the asking and answering of questions job done. You might need something more like a faceted search or multi-phase map-reduce style processing pipeline. If so, you will need to resort to other means. Often, this is starting to mean turning to some forms of natural language processing, filtering, machine learning, search technologies, and graph analysis to get the job done at scale.

Causation to Correlation 

Client says something like, I just need to know what is going on now so I can figure out what to do now to make it better. This is a squishy one and I think I'll get more when I read the book. I expect to come back to this one later. Ultimately, the idea seems to be that often, these days, the what trumps the why when it comes to analyzing the data.

In summary, thanks again to Mr. Wladawsky-Berger for another thoughtful post on a book I now know I need to read ASAP.

Good times!  Got a big data problems? I'd love to hear about it, drop me a line.

 

Cloud Powered Agility and Innovation

Robert Abbott of Norwest Venture Partners posted at Venture Beat saying, "companies with 50 to 2,000 employees — is a sleeping giant about to be energized by productivity-boosting software-as-a-service." And wrapping up with something I find particularly intriguing and have thought about a good bit. He called a a Bring Your Own Application (BYOA) IT culture.

At a macro scale this sensing of being energized is, to my mind, a point of maturation in the very broad migration to on-demand cloud infrastructure. This is not new per say. No matter what you call it, it began to really find purchase in 1999 or so with LoudCloud. Interestingly enough, it didn't work out as planned then and LoudCloud become something different. I still remember roaming through freezing data centers in Freemont in 1999/2000 seeing those big walled off LoudCloud areas. Yes, there are things that go further back. But, this was it for me. 

By way of real life and my own example of how cloud is empowering companies in very interesting ways, at my new company Ekho, we use:

Accounting = Xero (SaaS)
CRM = Zoho (SaaS)
eMail, File Sharing, intranet = Google Apps (SaaS)
Phones = Ring Central (PBX in the cloud, Hyrbid.. phones on the desk, software in the cloud)
FreshBooks = Time Tracking (SaaS, integrates w/ Xero)
Email List Management = MadMimi (SaaS)
DataCenter = Amazon Web Services (IaaS/PaaS)

Given what I am building at Ekho, reactive cloud native stream analytics systems, I'd say that's quite a virtuous cycle of cloud feeding cloud feeding cloud. At every turn we are using the cloud to build the cloud! This is one key part of what makes us Lean and Agile to a razor sharp edge.

I wrote in a post called, "Is SaaS Cloud Computing" in June 2008 that, "The most interesting thing about Cloud Computing to me is that it enables entirely new types of business to exist and be economically viable that never could have persisted before.  The economics changed and this is only the tip of the iceberg.  Fun times!" 

I think that gets at the heart of what Mr. Abbott is saying. I just went back to re-read some others articles I wrote a while back. They are relevant as well.

"Cloud Computing Impacts Technical Agility", Oct 2010

"Why Should Businesses Bother With Cloud Computing", Feb 2009 

It is now five years later and my how things have changed!

I started nScaled in 2008. I thought it was going to be a cloud services and brokerage company. I was wrong or did not sell it well. We pivoted quickly to what was essentially a vertically integrated Disaster Recovery as a Service stack building the IaaS (data centers), PaaS (storage as a service), and the SaaS (the cloud control panel). This was very interesting but not what I originally thought it would for sure. But, that's for another post

Ultimately, I could not agree more with Mr. Abbott. He is likely right and I would only say that this change is already energized. It has been building up steam for 13 years

Source: TutorVista

Source: TutorVista

I think things are pretty energized at this point. I really hope we all have enough gas and fortitude to deliver to the maximum potential available! It's quite possible we are at the knee of the sigmoid curve though and we've only just scratched the surface. I hope that it true because if so, then this is the time when some incredible companies can be born! But notice, it's still UPHILL. So, there is a lot of good work to be done yet!

Part 2: WHEN IS BIG DATA ACTUALLY BIG?

I posted a few weeks ago discussing the concept of when big data really matters. I said, and I'll repeat. Big Data is big when it provides big (meaningful) insights that can then be used to drive business decisions and ultimately business value. I'm speaking of "business" in very broad terms here. Not just standard corporate business.

So, in my opinion, Big Data is much more than just data measured by velocity, variety, and volume. These are relevant and will impact your technology choices.

If you can ask and then answer big / important questions with data then it's a big no matter the volume, velocity, and variety.

Cloud Computing and Plato

My college philosophy is admittedly very rusty, but I was talking to a colleague recently and explaining why I'm doing and planning to do more of a particular in the AWS environment for Ekho that introduces instability on purpose. I was inspired in part of course by what Netflix did ages ago with Chaos Monkey.

 Soon, the conversation strayed to configuration management. Systems automation tools like Chef, Puppet, and AWS Cloud Formation, and others allow you to express infrastructure as code. This is critical to a complex systems survival in an inherently unstable environment (like most data centers). Things in a cloud native deployment environment are always changing. How to you keep up or get a handle on it all? How do you even reason about this when you try to explain it to people? This is where Plato came into play. I wanted a non-technical analogy.

In college I studied some Philosophy. But, only enough to be dangerous at parties where the beer flowed freely and pontification abounded more-so.  However, one particular set of lessons from my studies about Plato in a class on Gnosis at Bellarmin University has stuck with me over the years. That was Plato's Theory of Forms. From Wikipedia today, the theory states: 

Platos's theory of Forms or theory of Ideas asserts that non-material abstract (but) forms (or ideas), and not the material world of change, possess the highest and most fundamental kind of reality.

Source : https://en.wikipedia.org/wiki/Theory_of_Forms

It turns out, this applies very much to my thinking about how cloud infrastructure is instantiated using tools like Chef and Puppet. The code part of infrastructure as code is the abstract form as a representation of the idea of what could be. The deployment itself once instantiated is just a copy. Or, in the terms of Plato, the material. It is the form itself that possesses the most fundamental reality. Or, in this case is the configuration of record.

If you use tools like Chef to design, manage, build, and maintain your infrastructure the perhaps you are espousing philosophy daily. What you instantiate on a cloud like AWS is certainly by its very nature, ephemeral. Yes, even if you use EBS volumes. Eventally, it'll all go away or be replaced by the next revision leaving nothing but the code it was spawned from. But then... eventually, that will fade away as well. This is the part where beer is needed because you then wonder what happens with the idea that spawned the code that spawned the server is forgotten by the original person. We'll probably need some heavy duty graph analysis to go from there to all the possible permutations.

GraphLab Workshop 2013 - Follow On Thoughts

I attended the GraphLab Workshop 2013 on Monday this week in San Francisco. It was a good set of talks. I went to this event to get another solid point of view on state of the art in graph database and analytics in the context of data mining and machine learning. There are a few things that really stood out. I thought I'd outline those thoughts.

One of the standouts for me was the work that has been done with GraphChi as a derivative of GraphLab  I think this looks to be a very important tool in the adhoc analysis, development workflow, and general push forward for knowledge about graph analysis and databases. Historically, it's not easy to get access to the types of systems needed to learn and test graph analytics. This really makes a dent by letting you work on graphs of substantial size in reasonable time frames using very little hardware at an extremely low operational complexity. GraphLab itself looks to be planning some good work in this ease of use / getting started area as well. A funny question from the audience was, "Did you (graphchi) single handedly kill big data?" Well, of course not, but thinking through that does illustrate that big data isn't all about big infrastructure. From my point of view, big data is about big insight in the most effective manner possible!

I found myself a little surprised at the near complete (with 1-2 small exceptions) mention of other frameworks. This was a GraphLab event so I guess that should not be that surprising. But, given that much of my prior explorations in this area were in the Spark/Shark area I was hoping for a little more comparative analysis.

There was, again and again a recurring theme on the underlying technicalities of doing large scale graph processing. That is that at scale (from size of graph perspective) it's very easy for the communications to become the bottleneck. This makes sense of course. However, what struck me in one of the talks in particular is just how big of a graph you can store in a single machine today. For example, one slide from a twitter speaker could be inferred showed being able to store 40 billion edges on a single commodity server w/ 288 GB of memory. In other words, you can do some very sophisticated things with relatively little hardware. So, this might be one of those use cases where you need a good reason to go out and marshall 100's or 1000's of servers when you might just need one or two.

The "IceCube" project is unbelievable. Wow! How did I never hear about this? This is a neutrino detector for stuff that passes through the earth. Meaning, they have optical sensors turns toward the interior of the earth that collect and generate a LOT of data. Graph analysis is then used to determine the good signals from the bad signals onsite (very remote site) and then only send off to the lab over the satellite link what really seems to matter.

I was also struck by how much the graph processing can potentially benefit from the tools and techniques embodied in technologies like Scala and Actor Model implementations like Akka and some features like Futures. Blocking is bad in graph analysis and it just seems, although I need more data to back it up, that things like Actors and Futures could be very useful in this context.

Lexis-Nexis was doing some very interesting work analyzing data from mobile devices in airports and other areas in concert with technology from companies like Cisco to provide indoor geolocation and help them analyze airport traffic flow patterns over time. This made for some lovely graphs.

I was introduced to BrickStream as well at this talk. In short, imagine putting a Kinect-like sensor in your warehouse or store thereby essentially giving it eyes. The internet of things is very alive and we'll definitely be needing powerful data analytics technologies to make any sense out of it at all. Graph processing seems to be much at the heart of all of this effort.

Demographics. There were 553 people in attendance and a large percentage were women. This is great! You read a lot about the lack of women in technology but they certainly were in attendance at this event in force.

That's about it off the top of my head. If any of this is interesting to you and you'd like to chat about it reach out to me on twitter @kentlangley. For my part, I have some very interesting applications I'm being asked to build related to all of this and looking forward to building even more awesome software that is faster, bigger, and smarter than ever before!