storage

Gluster 3.1 GA Release

Over the past couple of months I was taking a really close look at GlusterFS for potential use on a virtualization project.  Today I saw the notice the version 3.1 was released.  That's good news.  They call it a scale out NAS platform which it is but it's also a bit more than that too.

I had the chance to speak at length with Anand Babu (AB) Periasamy and a few members of his team at VMWorld recently about 3.1 prior to release and it was genuinely interesting and exciting. I've been following the Gluster project for years and it really just seems to keep getting better and better.  Not only that, they seem pretty passionate about what they do which is always a good thing.

Of particular interest in 3.1 is that you are now supposed to be able to add and remove nodes to the cluster without impacting the applications using the cluster at all.  This is CRITICAL and was a major barrier to adoption perviously.  Previously you actually had to restart the cluster to expand.

One of the things that can be challenging is large scale file sharing to many, and sometimes varying numbers of, application servers in large scale web environments.  I could see GlusterFS 3.1 being very useful in this scenario.  One recently published example of this is the way that Acquia uses GlusterFS for scaling Drupal.  

Of course, other options exist such as Swift from open stack, MongoDB w/ GridFS, Riak perhaps in smaller file size senarios, and perhaps Ceph which just released.  The file / storage space is hot right now with change and even *gasp* innovation.  It is pretty exciting and more choice over the last few years has been a very good thing.

I suspect I'll be writing more about this in the future assuming I can get some of the testing I want to do completed.  As usual, my lab in my secret lair is under powered and over utilized. *sigh*

Building43 GlusterFS Post Comments and Follow Up Thoughts

On the building43 site there was a post about running GlusterFS on Rackspace Cloud Servers (or slicehost - same thing really).  It is a good article that can help you get up an running.  

I happen to really like GlusterFS and will almost certainly be blogging more about it in the near future as I have been doing some work with it lately.

However, in a Rackspace environment due to cost considerations it seems that it's not very efficient to use it for storage there.  The reason is pretty clear.  The cost per GB of storage is much too high in my opinion.  Here are the comments I made.  The cost per GB of storage is so high because you can only get the storage as part of a machine and you can only get larger amounts of storage with the larger instances (which get quite expensive per month).  This is quite different from Amazon EC2 where you can provision EBS volumes to add storage to your storage nodes which should help drive down storage costs on average.  Anyway, here were my comments from the post and I encourage you to read the original post as well as check out the GlusterFS project.

The issue w/ running GlusterFS on Rackspace is that there is no way to add more block storage to an individual node. Also, according to the price list here:
http://www.rackspacecloud.com/cloud_hosting_products/servers/pricing

So, a 620GB node would be ~$700 /mo or $1.12 per GB. Then, of course, you need at least two or you haven’t really done anything useful. So, your price per GB will double to $2.24 / GB. That’s quite expensive.

Or, you’re limited to tiny little nodes. A 256MB instance with 20GB of storage * for about $11 / mo is $0.55 / GB * 2 for $1.10 / GB.

* You won’t be able to use all 20GB for your volumes.

Basically RackspaceCloud is missing an EC2 EBS-like analog for more affordable block storage.

The network bandwidth is severely limited for the smaller instances. This could be problematic. Or, is it not limited on the internal interfaces? That is unclear to me at the moment.

I love rackspacecloud and use it all the time. But, I probably would not use it for this in this case for anything very big at all. But, the way described in the article is a nice way to do active/active on a couple of nodes in addition to your applications that might already be there and running in a load balanced way.