Lots of stuff deletedAgain, I am talking durability here.
Hi again (and probably for the last time on this thread). Yes, I understand you're talking about durability, but you need to think about why you need durability. It's a requirement for transactions to ensure that they can recover from failures: ignoring participants, the durability aspect is intended for the coordinator to write its transaction log (aka intentions list) to durable storage, so that, in the event of a failure and subsequent recovery, the failure recovery subsystem can complete the transaction. Now, durability doesn't necessarily require a hard disk: NVRam would do in some cases, and active replication with an "appropriate" number of replicas may also be used. As with anything (including hard disks) you can never get a 100% guarantee of recoverability: the D is only a good chance to survive failures excepting a catastrophic disk failure.
As I mentioned to Andreas, I made time to write down some basic thoughts on the subject. It's a little long, but it's still no way sufficient to do this topic justic. Hopefully it'll wet your appetite to do some exploring: it really is an interesting topic, and although perhaps not "rocket science" it's not trivial either.
First thing to remember is that availability isn't necessarily proportional to the number of replicas: throwing more replicas at a problem can actually reduce availability and performance. Availability depends on the number of replicas, but also critically on the replication protocol you are using. For example, if we use an Available Copies protocol (active replication, where there's no primary), then you can read from any one replica, but you must write to all available replicas. So, increasing the number of replicas *may* improve availability, but it will certainly decrease performance for write operations.
In fact, the number of replicas and their placement is critical across all protocols. There's been some work in this area over the past decade or so, but replica placement is important. Take another example: if all of my replicas reside on machines that draw power from the same source, then I'm not improving availability to power failures. In some cases, I may actually be able to improve availability and performance by reducing the number of replicas but placing them on highly reliable machines. But this isn't a hard-and-fast rule and anyone using replication needs to consider their personal environment and the types (and number of) failures they want to tolerate. You never get something for nothing, so you're often trading off performance for availability.
As with transactions (where you trade off performance for fault tolerance and isolation), availability is probabalistic. Even if you took an Available Copies protocol (AC) and threw 10 replicas into it, there's still no 100% guarantee that they won't all fail at the same time: all you can do is try to reduce the likelihood. But anyone who knows their probability theory and/or statistical mechanics, knows that the universe always loads the dice, and given enough time anything is possible (like all the air molecules in the room you're sitting in suddently deciding to go and spend time together in one corner).
After all this, what I'm trying to get at is that saying 4 replicas is better than 2, or that 10 is better than 4 isn't necessarily true, and if it is, it may be by tiny fractions of a probability percentage (e.g., 99.95% availability for 4 versus 99.96% availability for 10, but 10 write operations per second versus 5 write operations per second - I use these figures as examples only, as they taken from some work we did back in the 90s and are not representative of all protocols and deployments).
Next there's the protocol type: active replication, single copy passive replication, coordinator-cohort (somewhere inbetween), strong consistency, weak consistency. And within these there are different flavours (AC, weighted voting, epidemic, leader-follower, lazy replication, viewstamped, psync, etc.) The reason there are so many is that they each make assumptions about the environment in which they'll operate (e.g., it can't be partitioned), and the objects/states that they'll be used to replicate (e.g., active replication assumes that all replicas are deterministic). Once again, there are tradeoffs to be made. For example, if you take a particular active replication protocol, you may get really good performance for read operations and perhaps even for some write operations, but it may not be able to tolerate partitions and does require you to ensure all replicas are deterministic (if they start in the same state and receive the same set of messages in the same order then they'll end up in the same final state); that's actually harder to ensure that you might expect, especially when you throw non-realtime operating systems into the equation. Happily passive replication doesn't have this deterministic limitation since only one replica (the primary) receives requests and then checkpoints its state to the other replicas (the backups); a passive protocol may even be able to tolerate network partitions; the drawback is that performance in the event of primary failure is poorer than for active replication, because a new primary has to be elected by the backups.
So once again there are tradeoffs to be made, and these tradeoffs need to take into account the specifics of your requirements and environment.
With all of the above in mind (and it does gloss over a lot of issues), we can turn to the problem in hand, which is replicating a specific type of object or service (a JMS). All I can describe here is *one* specific way in which it *could* happen, and this shouldn't be taken as definitive. I hope I've shown that it's just not possible (or safe) to say to all users that N replicas are what they need to achieve a certain level of availability; it needs to be done on a per user basis and only after careful consultation with that user to determine their deployment environment.
Let's assume we pick passive replication (and for simplicity ignore how the primary is chosen, but we'll assume it's the same primary for all clients); we'll stick each JMS server on a different machine as well. Furthermore let's assume that we want to tolerate 2 simultaneous machine failures (maybe even prioritise this on the type of message) and still be able to gain access to the latest data. In this case we might decide that we want to have 5 replicas, and the rule for the primary is that when it gets an update it has to sync that update to its local disk and to one of the backups before it can return to the client. In the background (after the return to the client), the primary or the backup can then disseminate the updates to the remaining backups, which are out of date.
BTW, this assumes we're not using multicast protocols (we'll stick with point-to-point). However, if we were, then there are obviously advantages to this that we could capitalise on.
Now you may ask what 5 replicas gives me over 2, since it's only 2 that are brought up-to-date initially. The answer is that in the event of more than 2 failures (though not simultaneous failures), the recovery time for the entire replica group can be better than having to start up a fresh replica and get it to obtain all of the state. Trickling updates to warm-standbys is often better than starting from cold.
And then there's "tricks" we could play with heartbeats, that are along the lines of epidemic (gossip-driven) protocols: in many replication protocols (e.g., virtual synchrony) the replicas need to keep track of who's up (in the group) and who's down (not in the group). They do this by heartbeat messages (pings) exchanged between each other (or this may be entirely driven by a master/slave). Obviously there's no way to tell with surety that a replica/machine has failed (network partitions or loss of messages may give the same appearance) until that replica/machine recovers and tells you, so all of this is actually failure suspicions. But I digress - back to heartbeats. Most of the time these messages will just be "are you alive?" or "what's your crash count?". However, because they are exchanged at regular intervals, you could piggyback more information on them, such as the changes in state that have occurred between the last message and this one.
I've tried to give a flavour of what is possible using replication protocols and techniques, but it's really difficult to do the subject justice in such a short space (it took me 200 pages in my thesis back in 1991 and things have moved on since then). There's no single protocol (and its rules) that has to be used to replicate all types of object/state. Choices are made based on the characteristics of the environment and the thing that is being replicated, but in order to make the right choices, you need to understand what the ramifications of them are (e.g., active replication *requires* determinism). And it's important to remember that just throwing more replicas at a situation isn't necessarily going to improve availability.
BTW, another option might be to say use a virtually synchronous reliable causal broadcast mechanism, so that only related messages are seen in the same order by everyone. The advantages to this should be fairly obvious.
All the best,