If you know me then you know I really like grids, clusters, and large scale computing technologies. So, I was happy to get the opportunity to attend the Hadoop Meeting up in San Francisco a little over a week ago. They said it filled up in 1-2 hours so I was lucky to get a spot it would seem. I'm a fan of the Hadoop project and I've been thinking about it for a while. What follows are some of my observations and thoughts about the event.
The Realities
I asked what the technical barriers were causing a 2000 node limit. I got back an answer that the namenode for the HDFS cluster simply can't scale beyond a couple of thousands nodes since it's a single point of failure non-horizontally scalable system. That's not exactly the way they put it but that's what they meant I do believe. That's a surprising weakness really. The namenode does two primary things. It manages file system namespace, including the replication factor, how many replicas of any single file there are in the total system, and regulates access to files by clients. For better or worse, I found myself wondering what needs to be done is to port FSImage and EditLog, the two main components of the namenode, to a distributed database format like CouchDB. Everything in the name node seems to be just a name/value pair that would benefit from using a distributed columnar DBMS.
I also asked the admittedly odd question of what is Hadoop very bad at doing. I really didn't get a good answer on that overall. But, I'll interpret the entire evening into saying that it's bad at being highly available and massively scalable due to currently existing single points of failure. This is one area where Hadoop and all it's related components shows it's youth. Of course, what I really meant was, what can Yahoo NOT port to Hadoop for competitive advantage because it's just not in the Hadoop map-reduce problem domain. Unfortunately, I didn't ask the question well enough. In general, there is as much or more sometimes to be learned from failed mistakes and experiments than from successes. But, nobody wants to talk about that much usually.
I also observed a trend in the statements made throughout the talk in that the code could be significantly optimized to run better on fewer nodes. In short, that would certainly be a money saver and help push the system with it's current architecture a little further.
They also mentioned that in no way are they trying to do the Globus like thing of distributing work over a WAN (Wide Area Network). The excellent Globus project is a long standing Grid framework well worth examining. This is a serious limitation of the system overall in terms of it's overall scalability in my opinion. It was also contrary to an earlier statement that said, "if you are going to be scalable then you must be distributed." But, to be fair, they reminded us many times that the software is young and there is still a lot left to do. In particular, it sounds like they are working very hard on their advanced job scheduler which will likely have some relevance in this area some day.
I learned that in no way does Yahoo consider Hadoop to be truly "real-time" production ready. Although they are using it and actively porting applications to use it, they seemed very hesitant about proclaiming it production ready. I think this reticence is primarily related to performance and scalability issues and the difficulty in job scheduling in complex multi-tenant environments.
Another piece of software and knowledge I picked up that is worth mentioning was regarding HBase and Varnish. I have no idea why I didn't think of this but you can front HBase with Varnish (or any reverse cache; aka any CDN probably) to scale read performance of the dataset/documents in document data stores. I need to email Jan from the CouchDB project and ask him if he's ever tried this for accelerating/caching CouchDB reads. I use varnish plenty for other projects but I hadn't thought to use varnish this way. So, I found myself wondering why not front HBase with Akamai for a globally distributed read farm? Anyway, those were the kinds of technical gems I was hoping for at the meetup.
I would have liked a little more tech and a case study type of information as opposed to the Hadoop history lesson. I knew the history going into the event so I was looking for more readily useful information.
The Take Away
Hadoop is a very nice piece of work and it, combined cloud computing initiatives it is well on the way to creating some extremely powerful tools for many uses. There is a great deal to learn from what the Hadoop project has accomplished thus far and it seems like they are on the path to greater things. I certainly got the message that they are working hard on the scalability and availability issues. I think this just adds more validity to my feeling that we live in an very exciting time technologically and that we have really only just scratched he surface of what is possible. I walked away with some new ideas and a good checkpoint on where the technology is today. Thanks to GigaOM for sponsoring.