Moving at the Speed of Cloud
The majority of my work in the last three years or so has been all about receiving, getting, pushing, pulling, and generally wrangling streams of data (mostly social data) for the purposes of analytics, comparison, or saving across a broad range of products and services for startups (one of my own) and fortune 500 companies. It's been keeping me busy. All of this for the ultimate reason of helping businesses make better and more well informed decisions about products, services, and more.
During this time I and my colleagues have developed the relationships, partnerships, technology stacks, and processes necessary to deliver these types of applications very quickly and at a high quality level. This has been fun all in all and something for which demand seems to be growing quickly.
To give a sense of the technology "stack" I've mostly settled on for solving these types of problems we are using:
Languages: Scala, Java, Node.js, PHP, Ruby
Frameworks: Symfony2, Play2.0, express.js, twitter bootstrap
Data Store: MySQL, MongoDB, Riak, Redis
Infrastructure: Amazon Web Services
Orchestration: Chef, Custom Scripting, AWS Cloud Formation
That's just a high level snapshot of course, there are a lot of details down inside each of those items from favored libraries to DB clients, and configuration management frameworks.
The best part for me is that it seems like for the first time in a long time many buisinesses seem to understand and believe in the value of the application of technology to solving business problems as a first order task.
The drive for big data aggregation and analytics is a natural evolution of the the maturation of cloud computing as both a technology and a service/process. The continued evolution of programming languages, application frameworks, and even the general understanding of distributed service oriented architectures and how to program REST API's is all improving as such an incredible rate that it's just an awesome time to be creating software.
So much of what we are doing now has been "around" in one form or another for a long time. The science in computer science laid the foundations quite some time ago. It's only now that so much is becomming so accessible and the information on how to use all these tools is readily available.
I read a recent article/survey posted to Forbes.com that said the cloud is still three years away from it's full impact. The first cloud camp, where I did a session on developing for the cloud, was in 2008. That's only four years ago and look how much has changed! Awesome.
From where I sit, this is an exciting time with nearly unlimited possibilties. Ideas are critical. Exececution is just as important. If you want to talk about any of these things I'm usually found either in San Francisco or San Rafael so let's chat! Good times!!
Building an Application upon Riak - Part 1
- Making the Decision
- Learning
- Operating
- Scaling
- Mistakes
In any event, I’ll not bore you with all the details but we chose Riak. We originally chose it because we felt it would be easy to manage as our data volume grew as well as because published benchmarks looked very promising, we wanted something based on the dynamo model, adjustable CAP properties per “bucket”, speed, our “schema”, data volume capacity plan, data model, and a few other things.
The primary programming language for our project is Scala. There is no reasonable scala client at the moment that is kept up to date for Riak so we use the Java client.
We are running our application (a rather interesting business analytics platform if I do say so myself) on AWS using Ubuntu images.
We do all of our configuration management, cloud instance management, monitoring harnesses, maintenance, EC2 instance management, and much more with Opscode Chef. But, that’s a whole other story.
We are currently running Riak 1.0.1 and will get to 1.0.2 soon. We started on 0.12.0 I think it was... maybe 0.13.0. I’ll have to go back and check.
On to some of the learning (and mistakes)
Up and Running - Getting started with Riak is very easy, very affordable, and covered well in the documentation. Honestly, it couldn't be much easier. But then... things get a bit more interesting.
REST ye not - Riak allows you to use a REST API over HTTP to interact with the data store. This is really nice for getting started. It’s really slow for actually building your applications. This was one of the first easy buttons we de-commissioned. We had to move to the protocol buffers interface for everything. In hind sight this makes sense but we really did originally expect to get more out of the REST interface. It was completely not usable in our case.
Balancing the Load - Riak doesn’t do much for you when it comes to load balancing your various types of requests. We settled, courtesy of our crafty operations team on an on application node haproxy to shuttle requests to and from the various nodes. Let me warn you. This has worked for us but there be demons here! The configuration details of running HA proxy to Riak are about as clear as mud and there isn’t much help to be found at the moment. This was one of those moments over time that I really wished for the client to be a bit smarter.
Now, when nodes start dying, getting to busy, or whatever might come up you’ll be relying on your proxy (haproxy or otherwise) to handle this for you. We don’t consider ourselves done at all on this point but we’ll get there.
Link Walking (err.. Ambling) - We modeled much of our early data relationships using link walking. The learning? S-L-O-W. Had to remove it completely. Play with it but don’t plan on using this in production out of the gate. I think there is much potential here and we’ll be returning to this feature for some less latency sensitive work I perhaps. Time will tell...
Watchoo Lookin’ for?! Riak Search - When we stared search was a separate project. But, we knew we would have a use for search in our application. So, we did everything we could to plan ahead for that fact. But, by the time we were really getting all hot and heavy (post 1.0.0 deployment) we were finding our a few very interesting things about search. It's VERY slow when you have a large result set. It's just the nature of the way it's implemented. If you think your search result set will return > 2000 items then think long and hard about using Riak's search functions for your primary search. This is, again, one of those things we’ve pulled back on quite a bit. But, the most important bits of learning were to:
- Keep Results Sets small
- Use Inline fields (this helped us a lot)
- Realize that searches run on ONE physical node and one vnode and WILL block (we didn’t really feel this until data really started growing from 100’s of 1000’s of “facets” to millions.
OMG It’s broken what’s wrong - The error codes in the early version of Riak we used were useless to us and because we did not start w/ an enterprise support contract it was difficult sometimes to get help. Thankfully, this has improved a lot over time.
Mailing List / IRC dosey-do - Dust off your IRC client and sub to the mailing list. They are great and the Basho Team takes responding there very seriously. We got help countless times this way. Thanks team Basho!
I/O - It’s not easy to run Riak on AWS. It loves I/O. To be fair, they say this loud and clear so that’s my problem. We originally tried fancy EBS setup to speed it up and make it persistent. In the end we ditched all that and went ephemeral. It was dramatically more stable for us overall.
Search Indexes (aka Pain) - Want to re-index? Dump your data and reload. Ouch. Enough said. We are working around this in a variety of ways but I have to believe this will change.
Basho Enterprise Support - Awesome. These guys know their shit. Once you become an enterprise customer they work very hard to help you. For a real world production application you want Enterprise support via the licensing model. Thanks again Basho!
The learning curve - It is a significant change for people to think in an eventually consistent distributed key value or distributed async application terms. Having Riak under the hood means you NEED to think this way. It requires a shifted mindset that, frankly, not a lot of people have today. Build this fact into your dev cycle time or prepare to spend a lot of late nights.
Epiphany - One of the developers at work recently had an epiphany (or maybe we all had a group epiphany). Riak is a distributed key value data store. It is a VERY good one. It’s not a search engine. It’s not a relational database. It’s not a graph database. Etc.. etc.. Let me repeat. Riak is an EXCELLENT distributed key value data store. Use it as such. Since we all had this revelation and adjusted things to take advantage of the fact life has been increasingly nice day by day. Performance is up. Throughput is up. Things are scaling as expected.
In Summary - Reading back through this I felt it came off a bit negative. That's not really fair though. We're talking about nearly a year of learning. I love Riak overall and I would definitely use it again. It's not easy and you really need to make sure the context is correct (as with any database). I think team Basho is just getting started but are off to a very strong start indeed. I still believe Riak will really show it's stripes as we started to scale the application. We have an excellent foundation upon which to build and our application is currently humming along and growing nicely.
I could not have even come close to getting where we are right now with the app we are working on without a good team as well. You need a good devops-like team to build complex distributed web applications.
Lastly and this is the real summary, Riak is a very good key value data store. The rest it can do is neat but for now, I'd recommend using it as a KV datastore.
I'm pretty open to the fact that even with several months of intense development and near ready product under our belt we also are only scratching the surface.
What I'll talk about next is the stack, the choices we've made for developing a distributed scala based app, and how those choices have played out.
The SaaS Aggregation Benefit Mirage
In this service oriented on-demand world I’ve been running into something again and again lately that I’ve found interesting and a bit annoying.
To start, imagine I’m going to build an application that uses two 3rd party services on-demand. We’ll just call them service A and service B and say each have two features. For this example it does not really matter what the services do.
Service A
Feature A-1
Feature A-2
Service B
Feature B-1
Feature B-2
So, I create my application and it first uses service A do something and it uses Feature A-1 and A-2. Then, with the output of that it uses service B to do something else using feature B-2.
Now, a few months down the line when things are going great I get a call from my account manager at Service A telling me I can now get all the features of service B directly from them included. So, what they are telling me is that my service structure now looks like this:
Service A
Feature A-1
Feature A-2
Feature B-1
Feature B-2
Service B
Feature B-1
Feature B-2
On the surface this looks really good. It’s the same thing with less hassle right? Maybe not.
This is where my annoyance surfaces. Dig in and dig in well. What I find again and again is that it’s simply not true because of what I’ll just call the filter effect. What you really are getting with this new and improved service A is more like.
Service A
Feature A-1
Feature A-2
Feature B-1
Notice that Feature B-2 is missing and that probably no body mentioned it. Or, it’s more like:
Service A
Feature A-1
Feature A-2
Feature C-1
Feature C-2
Feature C-3
Feature C-n-OMG
Service B
Feature B-1
Feature B-2
And you don’t care because C isn’t B and all you need as A-1, A-2, and B-2. While they say it’s equal is not and the app use feature B-2 if you’ll recall. How much time did you just spend?
So, by the time you get through all this and figure out that the new improved Service A + B is pretty useless and all you really want is what you already have you will have wasted a lot of time. There are less features, more complexity, less control, and likely much worse service and support for the aggregated services since you have no direct relationship to the end point provider.
So, rambling aside the point is that these service provider mashup aggregaters are not what they often seem on the surface and I’m frequently finding that the best deal is going right to the source and that any “savings” on the surface likely gets eaten up later in a variety of ways that are difficult to predict. In most cases, it’s best to go to the source to get what you want.