Stop Staring at my Polyglot!

I received an interesting comment/question via my blog recently.  It went a bit like this... 

I’m developing a distributed cloud application but my developers are pushing back on me for having a polyglot database strategy.  What should I do?

I won’t get into exactly what it is they are doing since that would take several more pages.  This is something of a stream of thought post so apologies if it is a little rough around the edges.  The easiest way to answer is in the context of an application I’ve been working on for a while that has some similarities to what this person wants to build.  Everything I’m describing is part of an app I’ve been building with a client since earlier this year.

Typically you'll need a few layers of "data storage” for any distributed batch or real time application (cloud native application) which is what I understand that you are trying to build.

I consider anything that holds data that is for presentation, computation, or transformation part of the data storage architecture and I like to break it down by time in storage from least amount of time to most. 

Short or Very Short Term: Single node caches (like memcached) or volatile computer node memory
Mid Term: Queue's and IMDG's
Long Term  Durable Storage:  Dynamo and BigTable derivatives abound 

There are numerous database products that live in or even between those tiers these days; more than ever before.  By no means is what follows even close to an exhaustive list.  A quick list of the ones I have worked with in the last few months personally looks like: 

Short Term: Memcached, Redis, RabbitMQ, ZeroMQ, DRAM, APC, MongoDB
Mid-Term: Redis, GridGain, RabbitMQ, ZeroMQ, MongoDB
Long-Term: Riak, MongoDB, S3, Ceph, Swift, CloudFiles, EBS, HBase

Short-Term storage is ALL in memory, not persisted to disk, and not intended to be used for long periods of time.  Your application also has to be able to deal with the fact that this type of storage is essentially ephemeral.  If the node gets a KILL signal from some source or another your app needs to know how to deal with this gracefully.  In other words, storage here is not durable at all.

Mid-Term storage is used for longer running processes.  It benefits greatly from being distributed and having a higher degree of durability.  This is generally still where most of the work in done in main system memory (no disk I/O) but also where you might do complex calculations or data transformations on your way to your goal.  You do it here because it’s fast.  You do things here because they can be shared amongst lots of workers (like queue subscribers). 

Long-Term storage is used for exactly that, long term durable storage of important data that provides sufficient and reasonable interfaces from which to retrieve that data again when needed.  Preferably it’s possible to do things like map-reduce jobs so that you can iterate and retrieve what is necessary which you may then operate on at one of the higher levels up this stack.

You’ll see that I’ve put some of them in all or multiple categories which might seem odd until you understand how they work and match the technology to what ever you are trying to achieve from a business perspective.

I have a tendency to avoid things that require overly complex operational management issues for starting up projects because I like to try to get my TCO (Total Cost of Ownership) over time (3-5 years) as low as possible while achieving the project goals and SLA’s.  There are a couple of exceptions on the list above that do have more operational overhead (MongoDB and HBase) but they are good enough in the right context that you might want to learn and use them anyway.

Now, back to the question at hand.  Should I use one type of DB or many for the needs at hand.  In this case, I’ve told them that they should use as few as possible, possibly only one.  The reason for this was that in their case they will value speed, consistency, and lower cost of operations at this early stage of their project.  They are developing an interesting distributed system for cool reasons.  I recommended a choice to them that I think will help them get to their goals fast and cost effectively while allowing them down the line to break off pieces of the application later as and if needed.

Parting words are that it will, over time, be nearly unavoidable that this (and most) applications of a distributed nature end up being database polyglotoumous.  However, I do think it adds a lot of complexity and overhead and in the early stages of a project it's not usually necessary unless what you are doing is of great complexity in which case you might want to break that down anyway to something more manageable.