resources

Deployment with Capistrano via Webistrano

Where I work we frequently use Capistrano1 to deploy our PHP projects. The advent of Capistrano 2 and now a very interesting new web GUI for Capistrano 2 with multi-stage2 looks to have just saved me a good bit of work. Or, in this case, work I was probably never going to get done anyway. Webistrano is a web based GUI front-end to Capistrano. Capistrano is designed as a command line tool. Capistrano is very much, at it's heart, a remote execution tool to execute commands remotely over SSH on one or many remote target machines. This makes, for example, deploying or gathering information from many hosts at once very trivial and quick. That's good for systems administrators.

On NFS, ZFS, and OpenSolaris from Joyent

Storage is often on my mind these days.  I was recently going back through some data and found a presentation from last February.  It's quite relevant and well worth the time to watch. It is a 19 slide presentation presented by Ben Rockwood5 of Joyent. It's called, "NFS:  A Customer Perspective, Bridging the Gap Beween Research and Production."1  Nothing like a litte light reading to clear the mind!

In summary, it is a bit Joyent centric which is fine because they are doing really cool stuff but you'll learn about OpenSolaris2, NFS, and iSCSI strength's and weaknesses.  You'll get some hints about the Joyent Accellerator3 platform.  Finally, you'll get plenty of search bait4 to carry on searching and learning afterwards.

From a Business Perspective:

The crux of this is that if you choose and use your technology wisely and manage it even more wisely then you can indeed save time and therefore money when it comes to scaling your applications. 

Resources:
1. NFS:  A Customer Perspective - http://www.cuddletech.com/Connectathon2007.pdf

2. OpenSolaris - http://www.opensolaris.org/os/ 

3. Joyent Accellerator Platform - http://www.joyent.com/accelerator 

4. Search Bait - Information gleaned from a search, document or target that provides informationf or futher searches, informaiton, or targets. 

5. Ben Rockwood Blog - http://www.cuddletech.com/ 

Scalability and Performance: A Few Resources

I recently put together an email for some colleagues with a few links to various resources that were related to some questions about scalability vs. performance that I received.  These links are good resources for any developer or systems administrator interested in scalability.  I thought this might also be useful to others facing growing web applications.

Distributed Caching with Memcached - This is a good foundation article on the use and implementation of memcached.
http://www.linuxjournal.com/article/7451

Good memcached FAQ - Very nice FAQ for memcached.  Touches on some items that are a little harder to find elsewhere.
http://www.socialtext.net/memcached/index.cgi?faq

memcached at Facebook - A little article about how memached is used at Facebook.  A little dated but useful.
http://lists.danga.com/pipermail/memcached/2007-May/004098.html

LiveJournal Architecture Links - Studying LiveJournal's architecture is worth your time.  They made memcached.
http://danga.com/words/2005_oscon/oscon-2005.pdf

http://www.slideshare.net/vishnu/livejournals-backend-a-history-of-scaling

A broader view on Scalability with some good information
http://www.slideshare.net/techdude/scalable-web-architectures-common-patterns-and-approaches/

Quite Good overall PHP-centric performance article in the context of Javascript.
http://www.thinkvitamin.com/features/webapps/serving-javascript-fast

XDebug and KCache Grind - Learn to profile your applications.  Profile often and early to avoid problems later.
http://xdebug.org/

http://kcachegrind.sourceforge.net/cgi-bin/show.cgi

APC Opcode Cache - No well dressed PHP application should be without a good opcode cache save resources.
http://pecl.php.net/package/APC

Sphinx Search Search Engine

One of the tools I am very much looking forward to testing and working with is the Sphinx1 search engine.  I haven't managed to work up a post of my own yet but the fine professionals at mysqlperformanceblog.com2 have written a nice one and according their republishing terms it's okay for me to post it here.  So, I'm doing just that.  One of the reasons I care about this is essentially data partitioning.  In most moderate to large sized applications search should likely be handled separately from the main DB.  This just seems obvious but most out of the box CMS's open source or otherwise bundle it in and this can cause some problems later as the searchable dataset grows.  In any event, on to the article!

Sphinx: Going Beyond full text search

I’ve already wrote a few times about various projects using Sphinx with MySQL for scalable Full Text Search applications. For example on BoardReader we’re using this combination to build search against over 1 billion of forum posts totaling over 1.5TB of data handling hundreds of thousands of search queries per day.

The count of forum posts being large, is however not the largest we’ve got to deal in the project - number of links originating from forum posts is a bit larger number.

The task we had for links is to be able to search for links pointing to given web site as well. The challenge in this case is we do not only want to match links directed to “mysql.com” but links to “www.mysql.com” or “dev.mysql.com/download/” as well as they are all considered to belong to mysql.com domain, while searching for “dev.mysql.com/download”" will only match dev.mysql.com domain and files within /download/ directory.

Initially we implemented it in MySQL using partitioning by domain which link was pointing to. So “mysql.com” links were stored in one table group and “google.co.uk” on another. We still had serious challenges however - as each applies to many search URLS,
such as “dev.mysql.com/download/mysql-5.1.html” would match “mysql.com”, “dev.mysql.com”, “dev.mysql.com/download/” and
“dev.mysql.com/download/mysql-5.1.html” we could not use link=const where clause but had to use link like “prefix%” which means index could not be used to get 20 last links and filesort over millions of links we had to youtube.com wikipedia.org and other top domains was extremely slow. Not to mention counting number of links (and number number of distinct forum sites) pointing to the given URL or graphs showing number of links per day. To fight this problem we had to restrict number of days we allow to cover based on the amount of links to the domain… but for some top domains it was slow even with just 3 days worth of data.

You might point out if we had link_date between X and Y and link like “prefix%” kind of where clause we would not be able to use index past link_date part, it is true so we had to use link_date in ( ) and link like “prefix%” which allows to use both keyparts which is much better but not good enough.

Caching is not good enough in such case as we do not want a single user to wait for minutes. large variety of problematic search urls does not allow to use pre-caching not to mention general load on server such batch processing would put.

The first alternative to this approach was to store duplicate data storing link to “dev.mysql.com/download/mysql-5.1.html” as links to 4 url prefixes I mentioned above. Unfortunately this would blow up data stored quite significantly, requiring in average of 6 rows for each link and it does not solve all the problems - result counting and number of distinct sites were still pretty slow and we did not want to go into creating all this data as summary tables.

Instead we decided to use Sphinx for this kind of task which proved to be extremely good idea. We converted all URLs to search keywords and now these 6 rows become simply one row in sphinx index with 6 “keywords” - specially crafter strings which corresponded to the URLs. Of course we did not store these in the table but instead used UDF to convert URL to list of “keywords” on the fly.

As results we now can pull up results even for youtube.com for fractions of the second and we could show 3 months worth of data for any URLs. (We could use longer time span but we did not have enough memory for Sphinx attribute storage). It is especially great as there is still room for optimization - Sphinx stores word positions in the index, while we do not need them in this case as we’re doing kind of “boolean full text search”. Plus we can make index built sorted by timestamp which would allow to same on sorting which is now still happening.

Using Sphinx such non-traditional way required implementing some features more traditional for SQL databases rather than full text search applications. Group By was added to Sphinx so we could search number of matches per day, or number of matches per language.

For Domain Profile we’ve got to use even more of those features such as counting number of distinct sites pointing to the given url or domain etc. Honestly this is where we cheated a bit and distinct number is bit approximate for large numbers but it still works really well for our needs.

Sure we could use summary tables for domains but it would be a lot of work and raver inflexible if we would like to add some more features and take a lot of resources to update or periodically build for millions of domains.

As this part worked pretty well we also decided to use Sphinx for other part of the web site - Forum Site Profile. This uses some pre-generated data such as number of posts in total for forum or in thread but most of other stuff is built with Sphinx. This also uses fair amount of tricks using fake full text search to retrieve all posts from given web site or forum from the global sphinx index.

So in general we find parallel processing using sphinx pretty good solution for many data crunching needs especially when lack of parallel abilities in MySQL makes its use rather inconvenient and pre-generating data is inconvenient or impossible.

If you’re attending OSCON 2007 and would like to learn more about Sphinx we have a BOF on Thursday to talk on the topic.

1. Sphinx -  http://www.sphinxsearch.com/

2. An Excellent MySQL blog - SERIOUSLY! -  mysqlperformanceblog.com