Java Development News:

An Approach For Building Scalable, Available and Recoverable Applications

By Billy Newport

01 Jan 2000 | TheServerSide.com

This article is intended to provide approaches on building a scalable application. The application could be a web application, an e-trading system or any OLTP application that requires good availability and scalability. The basic idea is to split your application in to two portions. One that is not OLTP and can be replicated across boxes. The second piece is the OLTP portion and this must usually work on a single machine for consistency reasons. We're basically looking for a 'break' in that we expect the majority of operations to be sent to the replicated portion. The 'patterns' articulated here show how to make each of these scale well and operate in the presence of failures. There is no rocket science here and it's all old fashioned stuff. But, it applies to EJB application just as it applied to previous technologies.

Overview of what the site does.

The site has a main page on which users can either advertise an appartment for rent or they can search for all the listed appartments that match a query. The article will look at several approaches you could take for building such a site. There is no best approach. Each approach has merits and which one is best depends on project parameters, budget, availability importance, etc. It all depends.

First Iteration.

Here, we have a single box. It runs a web server, a servlet engine and a database. The application stores all appartments that are advertised in the database. This is obviously the simplest scenario. It's relatively cheap to develop this site. We could use Tomcat/apache or another low-end servlet solution such as jrun, WebLogic express, WebSphere Standard, or indeed just get Netscape enterprise server as it now comes with a servlet engine built-in.

We have a lot of choices for the database again. We could use free ones such as Interbase, mysql or postgres. We could also use a commercial database such as IBM DB2, Oracle, Informix or Sybase.

So, we could have a very low license cost if we choose depending on what products we choose. The down sides would be it wouldn't be so reliable as we only have a single box. If the box fails (CPU, memory, harddrive, LAN connection) then we would have to setup a second box. This could take some time. We might also need to use an old backup of the database. This means you may lose the days new transactions or the weeks if you do weekly backups. Acquiring and configuring the new box may take some time. So, from an availability point of view, this approach isn't great.

Scalability wise, we are also limited. We can scale up to the performance of the box. We could improve scalability by putting the database on to a second box. Both boxes could be the same power, but you'll probably find that the database box saturates before the web one does. So, a larger box for the database would make sense. We can ultimately only scale vertically using this approach. That means, we can go as fast as a single box allows. If we want a faster servlet engine then we buy a larger box. If we want a faster database, we also buy a bigger box. This is an expensive way to scale up. Larger boxes are usually a lot more expensive than their less powerful counterparts.

So to summarize, we could use two boxes. One for the servlet engine and another for the database. Scalability is vertical only, availability would be poor. This sort of arrangement should only be used for non critical sites or if it's all you can afford.

How-ever, a lot of sites operate just like this. Systems designed to scale vertically typically give very good single box performance numbers when compared with systems designed to scale horizontally. Usually a system thats is built to scale horizontally will trade some performance for availability and a significantly higher throughput due to the ability to use multiple boxes.

Second Iteration.

Here we will just extend option A. We will use multiple boxes for the servlet engine and a servlet engine that can be used in a fault tolerant configuration. The database is still on a single box. We will find here that whilst we can scale the servlets horizontally now across multiple boxes, the box hosting the database will still be only vertically scalable. This limits the usefulness of this arrangement. We'll still need an expensive machine for the database if we really want to let multiple servlet boxes perform well.

Availability wise, if a box hosting a servlet engine fails then we can potentially fail-over transparently to one of the remaining servlet boxes. This is probably where you'll see the difference between a free or open source app server and a commercial one. The commercial ones will support http session fail-over that when combined with http request load balancing will make the front end of your end pretty much bullet proof. It's worth mentioning here that PHP4 and mod_perl with extensions also support session fail-over.

How-ever, if the database box fails then everything comes to a stop until the database box can be restarted. We can't really fail the database over to one of the servlet boxes as the servlet boxes are not the same size. You could use the same type of box for both database and servlet engines. This may be expensive though but is arguably simpler to manage when all boxes are interchangeable.

It's worth noting that you can improve the situation by implementing a caching layer on top of the database in the servlet layer. EJB caching won't really work as every EJB container I know of always sends queries to the database engine. They can only take advantage of the container cache when you do a lookup using only the primary key. This is unlikely to be the case for this application.

Another possibility is Oracle 9i. The client for 9i includes queryable snapshots/caches of the main database. This off loads the database engine when the query can be run against the clients local snap shot.

If you're not using 9i, then you could try a product called Times Ten. This is an in-memory database that is very, very fast and can act as a queryable cache for a back end database. If you are thinking of relying on some memory caching scheme to satisfy your queries then make sure that the cache can survive a database failure. Make sure the connection pool can remove 'stale' database connections to the fail database and will sensibly handle the situation where a connection is needed but the database is down. I mean, of course, that the application is notified that a connection is unavailable after a timeout. The pool should also handle the situation when the database comes back up and start returning connections again. How-ever, I'll cover this type of approach more in the third iteration section.

It is when most projects get to here that most sites are probably going to get stuck and get expensive hardware wise. You may be tempted to get big iron in for the database machine. Some people argue that this is the cheapest way to scale the site. You may buy boxes such as IBM S/F80s, Sun E4500/E6500 boxes or lower down boxes like 8 way PIII boxes. Linux boxes aren't probably a good choice at this stage for running a large SMP box although this should get better with Linux 2.4 and later.

Third Iteration.

Our site allows clients to do searches against the appartment database. These searches can be quite complex and quickly saturate the database. This has limited our scalability so far. Once the database box saturates then it doesn't matter how many servlet boxes we use, we won't go any faster, indeed we may start to get slower.

Also, from an availability point of view, we're currently weak as once the database box fails then we can't do searches and we can't add new listings, so we're down until the database is restarted. We'll address both of these issues at least partially with this revision. We will install a database instance on every servlet box. We then setup database replication to replicate the database tables from the master database to all the databases on the servlet boxes. When a servlet box does a search, it now does it against the local database instance. Searching is now a lot better. The master database is no longer a bottleneck for searches. If the master database fails then searches can still continue as it is not used for searches. When a listing is added then it is added to the master database.

We have solved the searching problem from a performance and an availability perspective, how-ever, the master database is still a problem for adding listings. If the master database is down then we can't add listings. We can solve this a number of ways:

  • Buffer new listings locally.
    We store new listings in a special table in the servlet boxes database. A daemon on the box copies the entries over to the master database when it is up. When it's down, the daemon just waits until the database is up before resuming copying listings to the main database. We should use a two phase commit here to make this reliable.
  • Use messaging.
    Install a good quality messaging system on all servlet machines and the database machine. The servlet boxes just send the database box a message when a new listing is added. The database box has a daemon running to consume these messages and insert the listing in the database. The replication then pushes the new records out to the servlet boxes when scheduled. Once EJB servers start supporting message bean then you'll be able to run this messaging code inside the application server. The message bean containers will also handle pooling which will allow you to easily throttle the number of concurrent messages the VM will process. You will also be able to have transactions that span JMS and JDBC. This is a big improvement over today as you basically must use C++ today to get this transactional behaviour.
  • Database multi-master replication.
    It's possible that your database can replicate inserted rows on the local database to the 'master' database where they will later be replicated out to the other nodes. You could implement this using messaging but why bother. If you can get a pre-built solution that you know works then I'd stick with it unless you need something special. It's also worth noting that this type of replication can be problematic if multiple machines update the same records. This sort of contention is usually resolved using an automatic approach such as latest modification wins or worst case you use a manual approach. How-ever, it's best to just try to avoid these issues if possible.

The first option is a typical low cost, get it done approach. You could probably even use the databases replication mechanism to implement this. It will work but at the high end needs more work to scale up as we will shortly see. The messaging option depends greatly on the quality of the messaging product you have. If you use an RMI/IIOP based messaging product (this is what you get when you use the one that comes with an application server) then you have the problem that if the destination box is down then you can't send the message. This is no better than what we started with.

If you use a higher end product such as MQ series then if the remote box is down then the messages are buffered by MQ series on the servlet boxes until the database box restarts. It then sends the buffered messages.

It's important to realize that this is not a JMS problem, it is a problem with a particular JMS implementation. Remember, JMS is merely a collection of Java interfaces. The implementations of these interfaces and hence its characteristics depend utterly on the implementation.

So, we've improved a lot with this revision. Searching is pretty good now, we can scale horizontally and availability is as good as it gets. Each box has a local database for searches. If the box is up then we can search. Adding listings is also better now. I'd chose the messaging solution. This lets us continue to accept new listings even when the master database is down.

If you use Oracle 8 as the database then it's worth noting that it includes messaging. It can also be used for remote messaging. Given that we will have a database instance on the servlet boxes in any case then we could take advantage of this for our messaging needs.

Iteration 4.

Ok, we're good on the searching side. But, we are still using a single box for holding the master listings database and for accepting new listing messages from the servlet boxes. We need to partition the database to solve this issue. We want to be able to scale horizontally for this so that we can scale up simply by adding boxes.

We can statically partition the listings database. For example, if every listing has a contact telephone number, we could use the last digit to indicate which of ten databases to store the listing in. We have ten queues to receive new listing messages from the servlet boxes. We have ten boxes, each hosting a database instance and hosting its particular queue. This lets us spread the new listings over ten boxes. We could choose any partitioning mechanism we choose. This partitioning introduces several problems. The searches. We had a single database before. Now, we have ten databases that we need to replicate to the servlet boxes and the servlet boxes must run the searches over these ten databases (assuming we replicate the ten databases to ten local databases). Using the last digit of the phone number for the partition key now should look pretty dumb. This basically guarantees all searches must be run over all database partitions. There may be a data element that allows us to confine a high percentage of searches to a single partition, for example, number of rooms or state/zip code. etc. Choose your partition key carefully.

If we add a box then we need to redistribute the database using the new partitioning mechanism and then update the searching/partitioning code. This probably means down time while we repartition. We can simplify our lives here by using a database built to scale this way. DB2 Enterprise supports this approach out of the box. DB2EE can run on multiple boxes. When a table is created, we can specify a partitioning mechanism and DB2 will spread the table over the boxes.

Of course, running a join over multiple boxes is probably going to hurt. Joins are usually done between your reference data and your dynamic data. Reference data is data that is normally not modified as a result of your most frequent use cases. Dynamic data is data that does change with the most frequent of your use cases. Assign the reference data tables to a single box and then replicate those tables to all the nodes in your DB2 cluster. So replicate the static data to all boxes, assign the dynamic data to a single box. This should be a familiar pattern as this point. Now if a box fails then so long as we don't run a query that needs more than one partition we should still be fine. This is basically a fractal pattern here. We're replicating the static data to improve both performance (intra partition joins are faster than inter partition joins) and availability is better because if we lose a box holding data needed by most queries then we stay up as each partition box has the data replicated. The distribution can be equal or we can skew it. This is important as when the site first goes live then all the database boxes are probably equal. How-ever, when we replace a box after 6 months, it's probably twice as fast as the older boxes. DB2 can accommodate this as when we re-partition we can tell DB2 to put double the data on the new node as the old nodes.

The best part is that DB2 manages this partitioning. The servlet boxes still see a single logical database. We can send it queries and DB2 breaks the query up over the boxes. When we need to add another box to the system then DB2 can repartition the database to include the new box without stopping the database (with limitations, of course, nothing is perfect).

We could use Oracle 8i parallel server. This allows multiple boxes to share the disks for a single database. It implements a special locking mechanism that allows this arrangement. How-ever, due to the 'pinging' effect (only during database updates), you'll still probably need to partition the database at the application level to get really good performance for adding listings.

If you're in a company that insists on standard builds etc for production hardware and has selected a set of components, storage, clustering, operating system builds then stop. You can get a certified cluster system running DB2EE or OPS from IBM or Oracle (with Compaq/Sun etc). They can give you everything (storage, fibre, lan, cluster interconnect, server hardware, operating system etc). Use this configuration. Yes, it's not the 'standard' configuration but you know what, it'll work and the vendor will be able to support you. They will tell you what service packs etc you can apply and how to do it without trashing the system.

Don't try to 'certify' it your-self or make it work on your build/set of components. It's hard enough to get this sort of environment working without trying to make it work in a configuration that probably Oracle or IBM won't support in any case. Do your-self a favour and get a preconfigured, working solution that Oracle or IBM can install and then support you on afterwards. Microsoft are currently only selling the high end Windows 2000 software like this. I think it's a good move. If you want very reliable systems then everything has to be controlled. This is why mainframes are so reliable. It is a very well controlled environment. I expect this trend to continue as systems become more complex and companies start demanding higher availability from vendors. This stuff is complex and you're better off relying on prebuilt solutions when trying to deploy this sort of software.

So, we now have a master database that scales horizontally. I was recommending messaging as the mechanism for sending new listings to the database from the servlet boxes. The daemon processing these messages has now become a bottleneck given that we have all these boxes hosting the master database. How-ever, it's worth noting that we're at the high message rate (>500) before this becomes a problem. This is a high rate of knots when it comes to accepting new properties. MQ series does provide features to help here. We can install 2 or more MQ series in a cluster. MQ then load balances the requests over the boxes in the cluster round-robin. This spreads the load out and increases our maximum message throughput. It also increases the availability of the system as when a box fails then subsequent messages will be processed by the remaining boxes. The messages delivered to the failed box will be processed when it restarts.

If a database box is down and a message arrives requiring that box then the message can be requeued to be processed after the currently queued messages.

A note on availability.

Failures will happen. We try to design the database so that the failure of a box doesn't bring down the site. We should also design the system to degrade gracefully if we get multiple failures. A failure may be fixable with a simple reboot. How-ever, a hardware failure usually means you need to get a spare part or box to replace the failed unit. If you're using mirrored configurations, i.e. two boxes or pairs of disks for availability then you're taking risks. Once one half of the mirror fails then you're left with a single component with no backup running the show. If this goes down then you're out of action. If you're serious about availability then make sure you can replace failed components very quickly to remove the single point of failure or even better use triples or better so that you can survive a single failure without relying on a single box to carry on.

If you use an LDAP directory for configuration or user databases then replicate it to every box in the cluster. If the LDAP directory is down then no one can login. Not good... If you replicate the directory to all app server boxes then if the application server box is up then your users can login even if the network to the backend systems is down.

Look at your network infrastructure. If all the machines are on a single subnet on a single switch then if the network fails then having 20 boxes does you no good. Make sure that your LAN availability matches the same expectations your servers do, otherwise, you're wasting your money. Any system is only as good as its weakest component.

If you have a set of dependant applications running on a box then its also important to make sure that these applications are running. If an application fails then it's important that it restarts and that the other applications can recover from the failure. Some application servers such as WebSphere come with software to do this. WAS Generic servers implement this functionality. NT boxes usually use services to accomplish this. The service starts the application and then starts it again if it fails. Unix boxes usually have HA software that can do the same thing.

Above all, remember the rule, 'PAIR and SPARE'. There is no point in mirroring anything unless you have a spare that you can use to replace the failed part (thereby reestablishing your pair) until the failed part is repaired. If you run on just a pair then when a failure occurs, you have a single point of failure.

A note on the issues of replicating OLTP data.

We have no consistency problems here. All changes are only made to a 'master' database. These changes are replicated to a number of 'caches'. When a client does a query then the query runs against one of the caches. This means that the possibility exists that the data is slightly out of date because the current changes have not replicated to the system. The data will be consistent at all times. All databases will only replicate complete transactions to the cache boxes. You can use real time replication to reduce this latency to a minimum but there will always be a time window after a change is made to the master before the change appears in the cache boxes.

If your application can't work like this then you can't use cache boxes and you'll need a very serious box for the master database and you'll have availability problems if this box fails.

What can happen if the user sees out of date information? He could see a stock quote that is five minutes out of date and then enter a market order that gets executed at a different price. Workarounds for this would be timestamping the quote to let the user know the quote is stale. Limit orders are another option. If you trade online then you know this can happen anyway, your market orders can be executed after minutes sometimes when the market is busy.

Another workaround would be to include a confirmation step to let the client confirm what is going to happen with the actual data values concerning the transaction. A workflow system integrated with your backend system is a very useful tool to have when you start going down this route.

A more formal description.

We have a message bus. This needs to be very available and very scalable. Front end objects publish modification events to this bus. Core objects subscribe to these messages. As the core objects process these messages, they publish the data modifications to the bus. The front end objects subscribe to these events and cache the results. We scale up well on the front end because we can have many front end objects and with each caching all or a subset of the system state, we reduce the need to hit the bus and the core objects. We can scale the core objects by having a set of core objects and each is responsible for some independant partition of the system state. They could use selectors to only get messages that apply to the partition that they maintain. We could even have core objects that manage the same partition but with independant databases beneath them. When a messages comes in to change the system state for an item, they both do it independantly using completely seperate back end databases etc. These 'replicas' know about each other and because both are independant of each other but are basically always in synch, if one fails then the one can take over. This is an example of what is known in the business as a process pair. Jim Gray and Andreas Reuters book cover this is detail (see the book reviews section for a link).

This abstract architecture can be implemented many ways. I describe it above using a message bus comprised of (database replication for one way and a messaging product for the other). The front end objects are a combination of the cache database and the servlet engine. The core objects are our master database and the daemons consuming the modification messages coming from the front end. You could use a messaging product instead of database replication. But, you have more work to do. You can argue that publishing update messages is more generic than database replication, i.e. if we needed to feed the updates over a WAN to a third party then database replication is not so suitable. This is fine but understand also that there are messaging to database replication bridges available. One example is Sybase Event Broker. This is a message broker that can receive database replication events. These events (basically SQL commands representing the transaction being propagated) could be transformed in to an XML message and then republished using JMS. Of course, this may not be so simple and in fact the reverse process is easier. Our core objects could publish an XML event representing a complete transactional update on the core database. Event Broker could receive this using JMS and then convert it to a replication event for our front end database to process as before. Other systems that need the updates can now just receive the XML messages using JMS without worrying about SQL. This may be useful in our example as maybe we could republish these events to a news paper or a partner site rehosting our ads for example. It's unfair to say one approach is better but you need to examine what you're trying to achieve and then pick an approach that suits what you want to do, how much time you have and how much money you're got.

Recovery issues.

A failed cache node is no big deal. The users will move to another cache node automatically. Your applcation server should do this. When it restarts then the cache database will be brought up to back by the database replication.

Anywhere you have two phase commit, you should be very, very careful about your failure scenarios. Don't be mislead that 2 phase commit means that your data is safe. It is probably safe if nothing goes wrong but if something does goes wrong then you can be left with a loss of data. For example...

We are using two Oracle databases. They are on seperate machines. Customers orders are on box A, customer billing information on box B. When a customer places an order, both databases need to be updated. Easy, use a two phase commit to do it. If we get a failure then both databases rollback. All changes get applied or nothing gets applied.

Look safe? Let's get Murphy out. The logs for both databases are kept on mirrored disks. Box A has a pair and Box B has a pair. Lets say that box B's log disks fail. But they are mirrored, how! Lets look at some scenarios:

  • Both disks in the mirror are located in the server enclosure. The power supply feeding the disks is faulty and delivers too much juice to the drives. This out of spec condition causes the failure of one drive in the mirror. OK, the second disk is good and the data is still available. A spare disk is used to start rebuilding the failed mirror. But, now the second drive in the mirror also fails due to the voltage condition before the spare could be rebuilt. The data is now lost. This is a true story!
  • The machine raid controller becomes faulty and wipes out the data on both halfs of the mirror. An inexperienced staff member used one controller rather than two. This is also a true story.

Well, there are a lot of ways Murphy can get you. So, the outcome is that box B panics due to an IO error from its log disks. You restart the database, it rolls forward any available logs during it restart process but it only recovers to the last complete transaction available in the logs. The transactions logged to the failed drive are not available to the database and therefore are not recovered.

If you haven't spotted it yet, we're in a tight spot. Box B's database is now not consistent with the database on box A. Box A still has the changes from the last changes committed. Box B doesn't. We're in a tight spot. OK, Oracle is actually pretty good in this situation. When Oracle finishes recovery, you can obtain the id of the last transaction recovered from Box B. You can then stop A and tell it to recover until that transaction. A and B are now in sync but we're lost all the changes in the transactions lost on box B. Customers may have been billed for goods that we're now not delivering as we lost the orders.

Can we recover the lost messages to the master database using MQ series logs? It's possible but the difficulty is that you'll need to make MQ series play all the messages from the point box B recovered to until the end of the 'real' log. This will not be easy either. This problem will be associating the MQ transactions with the Oracle transactions.

Now imagine you've outsourced your order fulfillment and transmitted the orders to them using you're new B2B link with them. Remember how cool you thought it was that everything goes straight through in real time? Try asking them to roll back their system. Try asking your customers to enter the orders again. Try asking the credit card companies to refund all those orders that you have no information on because you lost the data. The picture should be clear. This is a very, very bad scenario to be in. I asked my mentor at the time, what could I do to recover this situation and he replied, "Don't let it happen in the first place". Notice he was emphasing avoidance rather than recovery, this is very good advice. If you're mirroring disks, put the disks in seperate external cabinets, seperate power and UPS, seperate cables, a controller for each half of the mirror.

Fine, a gloomy scenario. What could we do to help? We would do a few things. All messages sent from a cache node to the master nodes could be saved in a table in the cache db on the cache node. This would let us tell the cache db to simply send all the orders again to the back end from timestamp X onwards if this scenario took place. This allows us to bring the back end database up to date and we can still probably serve customers during this recovery. These messages could also be marked as 'replayed messages' when transmitted and this may help the master nodes reconcile any problems with external systems (your warehouse fulfillment operation, the banks etc).

When we have safely backed up the logs on the master database for the day then we could purge the saved messages on the cache nodes to avoid the disk filling. These type of replay is also very useful if you find a bug in your master node applications, purge the effected transactions and then replay them with the corrected software. Once you realise how useful a replay facility is then you'll want to keep those replay logs around for as long as you can as insurance.

The motto here is try not to rely too heavily on things like two phase commit etc. Build in a replay mechanism to your application. It could just save your job or even your company one day.

Summary

We started off with a simple site. We showed how we can architect to system until finally we have a site that can scale horizontally and has good availability characteristics. You may think that this is merely an appartment rental site but in fact these principles can be applied to most systems. For example:

  • e-trading system (instinet).
    The central system pushes the security prices to cache boxes (Persistence Power Tier) located close to the traders over a WAN. Traders get the prices from these caches using a multicast messaging product (Tibco RV). When a trader enters an order, the cache box forwards the message to the central system where the prices are updated and pushed back to the cache boxes.
  • A large PC makers online web site.
    We replicate the catalog from a central database to multiple boxes that host the actual shopping cart application. When an order is entered then the order is messaged to a central order management system. Inventory status can be replicated out to the cache boxes. Of course, the inventory level shown to the customer is not always correct, it may be slightly out of date but this may be acceptable (we could just display either 'in stock', 'low stock' or out of stock, for example).
  • An EJB portal.
    We replicate all the message boards, site content to application servers that run the actual portal application. When a new message is posted or when a new member registers then we message to a central database and the changes then get pushed out to the front end application servers.
  • Online banking application.
    We replicate bank statements and customer information. But, if the user enters bank transfers, debits, account changes, etc, then we message these to a OLTP back end accounts system. Most queries are probably viewing statements or current balance and can be spread over the cache boxes. Time stamp the balance or allow a real-time balance query as a step.
  • Travel Site.
    Replicate the schedules and last known ticket availability to cache box. Clients can then searches against the cache boxes to generate schedules. When they want to buy a ticket we sent the request to the master system and it then does the 'ultimate' availability check.

The basic lessons or patterns to draw from this article are:

  • Replicate 'static' data to your processor nodes.
    This improves horizontal scalability and improves availability. You can do this using memory based products or a replicated database. EJB container caching can offer this but usually only for primary key type queries. Memory based caching schemes will need time to warm up when they restart and can't start if the master database is down. Database replication based nodes don't have this weakness.
  • Use messaging for implementing OLTP use cases.
    This allows you to buffer work if the target processor node/database is not available and allows you to carry on working. Not all applications are suitable for this pattern. Cash withdrawals from an ATM for example, you don't want to keep giving a customer their last ten dollars over and over again because the banks accounts database is down and the last balance we cached was ten dollars.
  • Use horizontal partitioning for your OLTP portion.
    Use a partition enabled product or set up your OLTP section so that you can spread it out over multiple boxes without creating a coherency problem (each data record should be owned by one box and one box only).
  • Ruthlessly seek and eliminate single points of failure.
    The degree to which you do this depends on a cost/availability trade off. How far you go depends on how far you want to go.
  • Get something to manage the boxes.
    Lots of boxes equals lots of hassle. Spend the time to implement a good box management infrastructure. This should minimise the management of the boxes. It should be able to detect a hardware failure on a box and then potentially rebuild the box on a spare box (hardware failures) and then add the box in to the cluster without needed a manual step ideally. Ideally, you want your boxes to be like a disk storage system. When one half of a mirror fails then the disk is replaced ASAP using a spare then is rebuilt from the remaining. It should also potentially monitor the applications them-selves and restart them if they fail and if the application won't restart then the box should be marked as bad and removed from the cluster pending maintenance.

It's important to point out that this may all sound easy but the devil is in the details. Companies like Loudcloud are making money doing much simpler sites than this. So, keep it as simple as you can and buy in middleware suites designed to work together to minimize integration work.

But, at the end of the day, no matter what you buy, your people and their collective experience is what will make the difference between a success and a failure. If you think developing and managing all of this is hard then you're right. A lot of commercial sites choose to go down the 'big iron' approach is an effort to reduce the number of physical boxes, interconnections and instances of applications, hence the number of interconnections and dependancies thus reducing the complexity of the system. The more parts you have, the more complex the system becomes. The trick as always is managing it.

Sun's excellent Dot Com Builders site has a case study on eLance, a site that have employed similar patterns to its structure. Click here for the case study.