What follows is a pretty basic checklist that I've recently factored out a many experiences in my day to day work as a "Scale Consultant." When I'm reviewing sites and talking to developers that are just starting out, have been in operation and just started to grow, or even well established sites who are migrating to Cloud Services I found that I was talking about these same things over and over.

This isn't intended to be an exhaustive treatment of the subject matter by any means. It is quite literally a "Was this considered when I was deciding on my software and systems architecture?" checklist. So, without further ado.

ORM for Data Partitioning and Query Splitting - Most databases are easily overwhelmed by Slashdot or Media effects. Having a way to split queries between updates and deletes from the start is a very wise move. This is often easily done now with the ORM layers that some frameworks use.
Monitoring process, resources, and uptime - I could probably write an entire article just about this subject. Monitoring done right is generally divided into three parts.
1. Process Monitoring - This makes sure things are running and stay running within certain tolerances. Examples are God, Monit, SMF.
2. Resource Monitoring - This is fine grained CPU, Memory, Disk Space, Disk IO, Networking, etc. Examples are Nagios, Ganglia, Munin, ZenOSS. Choosing correctly depends on your specific situation.
3. UpTime Monitoring - This is the only monitor people usually do if they do any at all. This should be a disinterested 3rd party to provide accountability and what I call a 3rd party eye in the sky should any dispute about uptime arise. I like webmetrics and pingdom at the moment as reasonably priced services that add good value.
Performance Testing and Capacity Planning - You can't drive a car if you can't see the road (or at least some representation of it). It is the same with your internet application. You can't make good decisions without doing some degree of Performance Testing and Capacity planning.
Static vs. Dynamic Content splitting / CDN - There is just no argument that this must be done at scale. There are many ways to do it. Reverse Proxy, Splitting Static and Dynamic content in a variety of ways, and more. Make sure your framework or application supports this feature. Some make it easy, some make it difficult.
Bundling and Compressing JS and CSS - Sites have a prolific amount of CSS and JS files these days. It's critical that you learn to bundle them, compress, version, and then properly cache those bundles. This can have a dramatic effect on page load time.
Logging - Log appropriately and monitor those logs. I never tire of sending developers back to their desk to check the logs when they tell me the server is broken. It's fun. Check your logs for common errors. In fact, perhaps you should write a small script to monitor the log files for seg faults, 500's, 404's, and other types of errors. Proactive rules.
Pragmatic Caching- There are many, many types and layers of caching. Most current web applications will have between 3-5 layers of caching at least to maintain acceptable performance and scalability of critical services. Learn everything you can about caching at the various layers in your technology stack.
Functional Decomposition - This was once overly expensive for many people. Now, with virtualization and cloud computing you can easily decompose your entire application into functional silos that are independently scalable and speak to one another as required. For example, app servers, monitoring, log aggregation, databases, message queues, upload servers, video encoding servers, and many more. Don't shove everything into one box anymore. Break it down by function.
Deployment - At the very beginning of your development lifecycle you should integrate your deployment process. It should be efficient, it should have a roll back capability, and it should be almost entirely automated to development, staging, and production environments. But, in some cases, humans should gate the deploy of course.
Asynchronous Practices - Remember that functional decomposition? For ever task when one function talks to another ask yourself, does that REALLY have to happen in real time or can it happen over time. Learn the CAP theorem. You will learn quickly that in most cases work can be queued and done by a separate process aside from the event that caused the work to need to be done in the first place. A good example is logging. I saw an application framework that kept every single logged event in a relational database and did all those inserts in real time for reporting purposes. Is that really necessary? Probably not. Put them into a file or even a cache and process the file elsewhere as a batch job out of user experience band.
Make sure your application processes are as lean as possible. I'll demonstrate by way of a bad example. If you application server requires 30-90MB for a single request thread to be processed and can only seem to hammer out between 6-7 of these requests per second then you're going to be hurting seriously in the wallet down the line. That's just way to expensive for most applications. On a reasonable sized application server you'd only be able to support a handful of concurrent requests. So you'd need 1000's of servers to handle millions of requests. I don't care what service you use, 1000's of servers are expensive!

Since I do this daily, these things seem obvious to me. They don't all fit for every situation because every situation is unique. I hope that by writing them down they help someone else out who is just starting to feel the pressure of growth. If you do/think of most or all of these things up-front things will be a little bit better for you down the line if things heat up.

Update on 2008-11-06 17:23 by Kent Langley

This article was translated to French. Here is the link:

http://www.haute-disponibilite.net/2008/10/28/fiabiliser-votre-architecture/

Update on 2008-12-08 00:04 by Kent Langley

This article was just republished by Sys-Con.com. So, here's a little link love back to them.

I found that is was quickly one of my most well read articles over time. So, I've been working on a follow up to flesh things our a bit more. It'll either be a single large document or a series. I'm not sure which I'd prefer to do yet. If there are any opinons or requests please let me know.

Update on 2009-01-02 21:13 by Kent Langley

One of the blogs I read, AKF Partners, posted a top 10 things for 2009 that is quite related to this post. So, it's worth a read as well I think.

Develop the ability to rollback
Break changes into smaller pieces
Remove SPOFs
Remove synchronous calls
Incent a culture of excellence
Develop a disaster recovery plan
Develop quality into the product from the start
Split your application or database
Start Logging
Celebrate your success

Full Article: http://akfpartners.com/techblog/2009/01/02/new-year%E2%80%99s-tech-resolutions/

There is a fair bit of overlap w/ my list in this article and another I published earlier about launching a website. All together they make a nice guide.

10 Simple Rules for Launching a Web Site

http://blog.solutionset.com/wpmu/2008/07/23/10-simple-rules-for-how-to-launch-a-web-site-successfully/