Datafication - term use to describe the ability to now capture and quantify many aspects of the world as data.

The following chart shows usage of the term "Big Data" overlaid with "Cloud Computing" according to Google Trends.

This is important. You can see, they have merged. Cloud Computing architectures are foundational for Big Data to exist in the forms and for the purposes we are seeing it evolve. Cloud Computing is an embodiment of the skills we have now begun to master for managing and marshaling computational resources in agile ways to developing and deploy Information Technologies that are capable of solving business problems quicker, more affordably, and more completely than ever before.

Vast quantities of computational resources are relatively easily accessible. This is relative to the cost and effort it took to marshal them only a few years ago. This is by no means self-evident to most and it is still very difficult to do at scale. The necessary skills in software and systems engineering are still less than common that some might like.

This article is inspired by an article by Irving Wladawsky-Berger on his blog. His article is inspired by a book that I just purchased and will be reading as quickly as I can do so. In his article he summarizes three profound changes that have been occurring in the context of Big Data. I have have personally run into all of these in my work building solutions to deal with big data. I'll summarize the definitions here and put them in a client centric context since I deal with that often. Read the full post and then the book to dig into the the nitty gritty details. They are:

n = all

It used to be okay if you said something like, I'll take 10% of the total data and use to that to infer in a statistically significant way what is going on relative to your data. Now, this is no longer okay. We have all the data, we can store all the data, and it's not good enough to only look at 10% of the data. We want n = all for our sample size. However, there are caveats.

Accepting Messiness

Sidebar

As an aside, one thing I'm seeing is that there is a LOT of data that's just been sitting around for quite some time now waiting for its moment in the sun so to speak. I think we are finally there. I'm working on some projects for a client now along these lines and there certainly appear to be many more such project in the future.

Unstructured data can be quite messy. What this means is that sometimes you are not going to be able to us the a standard structured query language (SQL) and relational database formats to get the asking and answering of questions job done. You might need something more like a faceted search or multi-phase map-reduce style processing pipeline. If so, you will need to resort to other means. Often, this is starting to mean turning to some forms of natural language processing, filtering, machine learning, search technologies, and graph analysis to get the job done at scale.

Causation to Correlation

Client says something like, I just need to know what is going on now so I can figure out what to do now to make it better. This is a squishy one and I think I'll get more when I read the book. I expect to come back to this one later. Ultimately, the idea seems to be that often, these days, the what trumps the why when it comes to analyzing the data.

In summary, thanks again to Mr. Wladawsky-Berger for another thoughtful post on a book I now know I need to read ASAP.

Good times! Got a big data problems? I'd love to hear about it, drop me a line.