XA Exposed: How 2 Phase Commit works in J2EE

Discussions

News: XA Exposed: How 2 Phase Commit works in J2EE

  1. XA Exposed: How 2 Phase Commit works in J2EE (107 messages)

    Mike Spille has spiled his guts out on distributed transactions. He starts off by looking at the history of 2 Phase Commit / XA and then delves into some of the innards of 2PC.

    "A critical item to understand about XA in the J2EE world is that your highest costs in terms of execution time are going to occur from "disk forcing". Most of the transactional work in XA is plain old work that you'll have to pay for one way or another - your RDBMS insert/update/selects, your JMS publishing, etc. Bits of the XA protocol around these pieces add no really noticable costs - in fact, in some implementations of XAResources, calls like XA start() and end() may be coded as fire-n-forget, so you're not even paying for a round trip network cost to the server.

    The real cost in XA is disk forcing the transaction logs at critical points in the protocol. Let me 'splain what I mean here, Lucy.

    In order to meet the failure/recovery requirements of XA, the Transaction Manager and all XAResource managers have to record transaction information "durably". In practical terms, this means they have to save the data to disk in some sort of transaction log. The XA protocol defines exactly when such transaction log disk-forces have to happen. This gives guarantees to the various components and allows them to make certain assumptions."


    Read Mike Spille on XA Exposed, Part I

    Threaded Messages (107)

  2. Monkies in the System?[ Go to top ]

    I personally hope that this statement was a freudian slip:

    "As you might expect, these early enterprise integration efforts lived up to the "enterprise" monkier by causing the waste of uncounted millions of new dollars"

    I assume you meant moniker. But I like "monkier" better. Mental vision: A bunch of monkey-coders sitting around hammering out the XA code.

    Jason McKerr
  3. Monkies in the System?[ Go to top ]

    Thanks for pointing that out, it's fixed now in the Blog.

    And I think you may be right - "monkier" might have flitted into my subconscious and escaped onto the page. What with me talking about enterprise integration people wasting millions of dollars, I could easily envision a huge consulting company charging $200/hr for literal monkey XA coders. :-)

         -Mike
  4. Excellent writeup. I look forward to Part 2 and hopefully a Part 3.

    Out of curiosity, does WebSphere use Component Broker as its TM?

    Also, do you know of anything out there that will do a "distributed JMS transaction" (commit only when the message has reached the target queue manager, potentially involving hops) or whether a spec is in the works for this? (if this even makes sense or is seen as a requirement by many)

    Thanks
    Ghanshyam
  5. Thanks for the positive feedback. Since I got several hundred hits on this entry in just a few hours, it seems there's some small interest and doing a part III would be worthwhile. Hopefully the boss and the wife will understand.

    \Ghanshyam Patel\
    Out of curiosity, does WebSphere use Component Broker as its TM?
    \Ghanshyam Patel\

    I honestly can't recall precisely what Websphere uses, even though I was told several times what was under the covers. Too many months ago and too much going on since then. I do know my work was under Websphere 4.x, and that 5.0 apparently uses a different transaction manager (supposedly closer to the reference JTA than the 4.x was). I do remember some deep stack traces with lots of transaction lifecycle hooks in it. I imagine Billy Newport could give you details if he's lurking about, and of course if this isn't some sort of devilish trade secret :-)

    \Ghanshyam Patel\
    Also, do you know of anything out there that will do a "distributed JMS transaction" (commit only when the message has reached the target queue manager, potentially involving hops) or whether a spec is in the works for this? (if this even makes sense or is seen as a requirement by many)
    \Ghanshyam Patel\

    Hmm, from your question I'm not clear what you're asking. Could be:

      - Transactions in a clustered implementation in general (Don't think so, but could be...)
      - Transactional semantics extended to cover receivers of messages? For example....a transaction is live until consumed, and the consumer can cause a rollback that goes all the way back to the original publisher?

    Could you clarify with a quick example?

        -Mike
  6. Distributed JMS transactions[ Go to top ]

    It is exactly the second case:

    "Transactional semantics extended to cover receivers of messages? For example....a transaction is live until consumed, and the consumer can cause a rollback that goes all the way back to the original publisher?"
  7. Distributed JMS transactions[ Go to top ]

    To add to the previous post, I know that such a thing would appear non-sensical since the very nature of messaging implies asynchrony and transactions in such an environment do not make sense (message could be delivered days later). However, almost all the instances I've seen messaging being used has been in near-synchronous environment where a response is expected quickly (within a minute or two, usually within few seconds). In such an instance, it would be nice to know if the receiver failed in it's task. The receiver and sender each do their local JMS txns but it would be nice to have them participate in a global txn...perhaps send the Xid in the message?
  8. Distributed JMS transactions[ Go to top ]

    If you look at such a beasty strictly from a business use-case point of view, it makes a tremendous amount of sense. You want the decoupling of messaging, but at the same time the far-end may be involved in critical processing and you may want to know if the far-end fails, and take some action if that occurs. At an intellectual and abstract level, it seems very attractive (to me, at least).

    The problem though is how to implement the nuts and bolts of something like this, and how it compares to other techniques. If you consider a non-XA transaction and JMS queues, then you've narrowed things down to the sender and the receiver. Here things aren't too bad, and are maybe managable. The only problem is that you need to keep extra transaction state open until the receiver gets the message, and you can propogate a response back. This can cause throughput problems and take up extra memory - and potentially also things could be locked while this occurs, reducing concurrent processing capability even further.

    Even if you can accept those, you have to consider failure modes - the far end may not be up, or there may be a network hiccup going on. Then you can be piling up open transactions for quite awhile. Or what if the sender goes down while the receiver is processing? What are the semantics then?

    That's the view for the simple case. If you're talking JMS pub/sub, it's much more complex since you can have N receivers. Add in XA transactions and things really get nasty - you don't want to hold something like a table lock in an RDBMS waiting for a potential rollback from a receiver.

    Some people have done some research in this area, and IIRC some IBM researchers have put together some systems that have this sort of distributed transaction semantics for messaging. Their solution, if my memory holds, involves a combination of regular transaction stuff and "compensating transactions" - which means you commit fast in local nodes, and a rollback consists of doing a second transaction which undoes what the original transction does.

    Ultimately, doing this sort of thing in a generic manner gets very messy very fast, and in all cases I've seen so far you either end up with a big performance hit and some system-wide fragility (you're relying on alot of components over a wide area to all get some tricky semantics right), or the app developer has to do alot of extra work to specify something like compensating transactions.

    In the end, trying to make this work for the general case seems to be more trouble than it's worth. Most of the time, it's far easier to have the application coordinate with asynchronous notifications back to the sender, and to thereby force your whole architecture (at least any piece involving messaging) to be all-asynchronous.

    A better way to put all of this is that you don't want too many "smarts" in your basic protocols. You're almost always better off using the simplest protocols possible, and layering behavior on top of them. Just like TCP layers on top of IP to add reliable message delivery and ordering, for something like distributed messaging transactions you'd be better off using the basic JMS building blocks already there and layering an async notification system on top of it.

         -Mike
  9. And I always thought ...[ Go to top ]

    ... Mike Spille disappeared from theserverside.com. Well, it seems this was true last year only;).

    Thanks for the article,

    Jens
  10. Distributed JMS transactions[ Go to top ]

    Compensating transactions are exactly what's needed for this case. I first heard of the idea in terms of workflow and BPM, where the steps are a series of asynchronous actions. Basically you just need to back out the changes that have been made. It is, unfortunately, more complex to do than to say, as the data effected may have been modified in the meantime... For these cases, workflows may need to be routed to a user for resolution.

    If you really want a single transaction and decoupled objects look at something like IoC or abstract factories to decouple your classes and do a regular synchronous method call within a local transaction. Trying to fit synchronous transaction semantics around asynchronous message architectures is going to be painful.
  11. Some gotchas????[ Go to top ]

    Mike:

    Some areas in the article that I found a bit suspicious or tough to believe!

    (1) In the example, PP-05 and PP-06 you mention that the end() call occurs only when the connection close / session close occurs though you have used a word "typically" to isolate it :-). I doubt this - I believe even if your connection closure is outside the scope of the txn boundary XA will work just perfect. I believe it is because txn.commit() is "overloaded" to take care of this scenario. Your thoughts???

    (2) 2pc-10 you say that the websphere writes prepare results only if the result is a commit. Assume XAResource-A voted rollback, XAResource-B voted commit. Assume this is not recorded, because you say <quote>If the first resource voted rollback, then this action is never recorded.</quote>. And before issuing commit Websphere txn manager crashes. Will XAResource-B commit or rollback when Websphere comes back up?? How will you know the fate of that txn as prepare() results are not recorded as you said in this scenario. I deruve this is because prepare() can be called N number of times. So the txn manager when it comes back up can call the XA Resource Managers again to ask the fate of the txn. In that case, why "disk force" prepare() at all - even if the result is commit() or rollback(), because even if the txn manager crashes , when it comes back it can call the prepare() again to get the results ?? Very confusing :-))

    -Sanjay.
  12. Some gotchas????[ Go to top ]

    \Sanjaya Ganesh\
    (1) In the example, PP-05 and PP-06 you mention that the end() call occurs only when the connection close / session close occurs though you have used a word "typically" to isolate it :-). I doubt this - I believe even if your connection closure is outside the scope of the txn boundary XA will work just perfect. I believe it is because txn.commit() is "overloaded" to take care of this scenario. Your thoughts???
    \Sanjaya Ganesh\

    The TM has pretty wide latitude in deciding when the end() happens. It can keep the XAresource "open" until txn.commit() happens if it wishes, but it can also call end() at other points as well, such as when close() is called on the connection/session. In fact, a TM can call start(), end(), start(), end() ... repeatedly on the same resource if it wants to.

    You are correct though that end() isn't really tied to closure of the connection/session. The only real "requirement" is that it happens at txn.commit() time, but it _can_ happen earlier.

    \Sanjaya Ganesh\
    (2) 2pc-10 you say that the websphere writes prepare results only if the result is a commit. Assume XAResource-A voted rollback, XAResource-B voted commit. Assume this is not recorded, because you say <quote>If the first resource voted rollback, then this action is never recorded.</quote>. And before issuing commit Websphere txn manager crashes. Will XAResource-B commit or rollback when Websphere comes back up?? How will you know the fate of that txn as prepare() results are not recorded as you said in this scenario. I deruve this is because prepare() can be called N number of times. So the txn manager when it comes back up can call the XA Resource Managers again to ask the fate of the txn. In that case, why "disk force" prepare() at all - even if the result is commit() or rollback(), because even if the txn manager crashes , when it comes back it can call the prepare() again to get the results ?? Very confusing :-))
    \Sanjaya Ganesh\

    I was going to get into these details in Part II, but I suppose I can spoil it all and answer here :-)

    If a TM forces only "Committing..." records, and it crashes with transactions in-flight, then this means transactions within the XAResources can be in one of these states:

       1) XAResource has transaction(s) in flight, pre prepare() call.
       2) XAResource has transaction(s) in flight, voted yes to prepare()
       3) XAResource had commit called
       4) No in flight trans, no problem ;-)
     
    State #1 we ignore - if we haven't called prepare() on a Resource and crash, by definition we assume rollback. Now we just have to make sure everyone is called back. For state #1 itself, if a given resuorce hasn't been prepare()'d, we let it die within the resource. This happens within the XAResource either by detecting that its connection from the TM was dropped, in which case rollback happens, or by noting that no activity has happened to a transaction for X seconds, and thereby timing out the transaction (timeouts are _very_ important in XA if you don't want resources hanging during failure scenarios).

    For state #2, this could be a global transaction which we should commit, or it could be one we should rollback (for instance, if another resource is in state #1 above). See below on how to resolve this.

    For state #3, the resource won't report this back to the TM, so we won't see it in the TM at recovery time :-) See below for this.

    State #4 is of course the ideal state if a failure occurs, no transactions were in flight and there's nothing to do.

    The above states per XAResource are interpreted by checking "Committing..." records in the TM vs. what the XAResources report, and figuring out the correct resolution of the transaction . In practice, when a TM goes down "abby-normal", it scans its logs for transactions marked "Comitting..." but not "Done". It then reconstitutes its XAResources from info in the logs, and calls recover() on each. Each recover call returns an array of Xids (Xid[]). Each entry in this array, if any, is by definition in the prepare() state (has been prepared), but not committed. This is called an in-doubt transaction. The TM gathers up these in-doubt Xids, tries to match them against the global transaction IDs in the log in "Committing..." state, and uses this algorithm:

      1) If there's a "Committing..." record in the TM logs w/ no "Done", we _must_ ensure this transaction commits in all resources. We check all resources recover() responses, and if there are any Xid's in there that match these committing records, we call commit() on them. Then we write "Done" in the TM log for each of those transactions.

      2) If there are no Xids from recover() responses which match a "Committing..." record, that's OK - that just means that everything was committed but we didn't get a chance to write a "Done" record. So we write "Done" in the TM logs for these.

      3) If we find Xids left over from the above, where some resources responded to recover() with Xid's that don't match a dangling "Committing..." record, we roll 'em back. It doesn't matter if all the resources voted "yes" to prepare and are just aching to commit, and the TM just didn't get a chance to write "Commiting". We just roll these suckers back.

    The key is that a "Committing" record lets the TM distinguish between in-doubt transactions in the XAReources which should be prepared vs. those that should be rolled back.

    As an aside, "Done" records don't need to be disk forced. The above algorithm makes it so you don't have to.

    The XAResources _must_ force at prepare() time because the XAResources don't know jack about the _global_ transaction state. Even if it votes yes, the tran could still be rolled back. Or it could be committed. But the point is that the XAResource can't know, only the TM can know.

    In any case, the piece you may have been missing was the recover() method in XAResource. And now you've forced me to write at least a quarter of my Part II article as a TSS response!! How will my blog ever get hits and achieve world-wide fame and acclamation now?!?!

       -Mike
  13. Some gotchas????[ Go to top ]

    Mike:

    Thanks and sorry :-) I did not intend to reduce hits on your weblog. 2 more "short" questions - probably having "short" answers :)

    (1) I hope you will cover scenarios involving Heuristic decisions in Part II / Part III :). THat would be a killer I would say. I know that there are cases where one would need to "manually" check in-doubt txns and take decisions. I always felt most heuristic decisions - will plant chaos if you look at a "distributed data integrity" perspective. What do you think?

    (2) YOu mentioned that TM (after coming back up from a crash) will use its TLog to get the list of XAResources that need to be called with a recover(). I think you mean this is actually the "Resource Enlistment" log.

    Thanks and if you would, please let me know an approx. time by which you plan to get Part II of this article out?

    -Sanjay
  14. Recovery procedure[ Go to top ]

    Thank you for your papers on Transaction managers and XA. They are proving to be a very good resource.

    In the interest of correctness of detail I would suggest one change to the paragraph below:

    >3) If we find Xids left over from the above, where some resources responded >to recover() with Xid's that don't match a dangling "Committing..." record, we >roll 'em back. It doesn't matter if all the resources voted "yes" to prepare >and are just aching to commit, and the TM just didn't get a chance to >write "Commiting". We just roll these suckers back.

    The Xid must have originated from this transaction manager. The XARresource may have hanging transactions from several servers outstanding at any given time.

    Along that line, as we implement our recovery in JOTM, I was considering keeping track of which Xid's in our system were "ours"; i.e., our instance of JOTM is the transaction coordinator for the Xid. If we have to respond to a RECOVER request, I see no reason to supply Xid's which we originated.

    If XA ever includes the "last agent" optimization of XCP2, then we would also have to respond with the Xid's for transactions for which we have sent the "last agent" flag.

    Do you agree with this design decision?
  15. Recovery procedure[ Go to top ]

    \David Egolf\
    Thank you for your papers on Transaction managers and XA. They are proving to be a very good resource.
    \David Egolf\

    Great!

    \David Egolf\
    The Xid must have originated from this transaction manager. The XARresource may have hanging transactions from several servers outstanding at any given time.
    \David Egolf\

    Yes, this oversight from the post here was pointed out by a few people, and the articles have this change in them (part II and III, IIRC).

    \David Egolf\
    Along that line, as we implement our recovery in JOTM, I was considering keeping track of which Xid's in our system were "ours"; i.e., our instance of JOTM is the transaction coordinator for the Xid. If we have to respond to a RECOVER request, I see no reason to supply Xid's which we originated.
    \David Egolf\

    I'm not entirely following you. TM's definitely have to track their own Xid's, generally by encoding them in a unique manner - use formatID to identify the "vendor", and encode some sort of server ID into the global tran ID and branch qualifier. What I don't understand is the "resonse to a RECOVER request" comment. Can you expand upon that?

    \David Egolf\
    If XA ever includes the "last agent" optimization of XCP2, then we would also have to respond with the Xid's for transactions for which we have sent the "last agent" flag.
    \David Egolf\

    I'm not familiar with XCP2....

        -Mike
  16. Recovery procedure[ Go to top ]

    /Mike Spille/
    What I don't understand is the "response to a RECOVER request" comment. Can you expand upon that?
    /Mike Spille/

    As soon as we implement J2EE Connector Architecture 1.5 we will be presented with inbound transactions which include Xid's created by other TMs. These other TMs are the coordinators of these transactions. We are behaving as a subordinate/intermediate node in the global transaction tree.

    If there is a failure, then recovery will be negotiated between the TMs. We expect the other TM to send us a RECOVER request. At that point, I see no reason to respond with Xid's which we know that we created locally. If we respond with all the externally generated Xid's then we know that this TM's Xid's must be on that list.

    In a sense, it's a shame that XA does not call for a list of input Xid's on the RECOVER interface so that bookkeeping can be kept to a minimum.
  17. Recovery procedure[ Go to top ]

    \David Egolf\
    As soon as we implement J2EE Connector Architecture 1.5 we will be presented with inbound transactions which include Xid's created by other TMs. These other TMs are the coordinators of these transactions. We are behaving as a subordinate/intermediate node in the global transaction tree.
    \David Egolf\

    Ah, the JCA inbound transaction thing. Now I understand.

    \David Egolf\
    If there is a failure, then recovery will be negotiated between the TMs. We expect the other TM to send us a RECOVER request. At that point, I see no reason to respond with Xid's which we know that we created locally. If we respond with all the externally generated Xid's then we know that this TM's Xid's must be on that list.
    \David Egolf\

    Yes, that makes perfect sense. In that case, you're effectively reversing the existing JTA XA approach. Normally, you filter the list from XAResources to include only your own Xids when you're recovering for yourself. In the case where transactions are injected by another party via JCA, then when you see recover called on you, and you call your XAResources in turn, in that case you'd want to filter the list to include all Xids which are _not_ your own.

    \David Egolf\
    In a sense, it's a shame that XA does not call for a list of input Xid's on the RECOVER interface so that bookkeeping can be kept to a minimum.
    \David Egolf\

    I don't see how that would work. Given presumed abort, and the fact therefore that TMs are only going to durably record transactions that must commit, at recovery time the TM might not know all of the transactions which are in-flight that it was responsible for. So the TM could tell XAResources what to commit, but that would leave transactions which should be rolled back potentially dangling forever in the prepared state.

    Of course, you could make it work if Xid was expanded to be a bit more than opaque byte arrays. If an explicit "transaction manager ID" field was supported in Xid, then you could use the interface you mention. In that case, resources would commit the transactions on the list, and then find in-doubt Xid's of its own which are not on the list via the "transaction manager ID" and roll them back.

    But, unfortunately in their zeal to make JTA XA look at close as possible to X/Open XA (and thus make it easier to interoperate), the JTA guys left us with nothing but two opaque byte arrays and a mostly useless format ID.

         -Mike
  18. XA is from X/Open group but what XA stands for? is this any abbreviation?
  19. Distributed JMS transactions[ Go to top ]

    To add to the previous post, I know that such a thing would appear non-sensical since the very nature of messaging implies asynchrony and transactions in such an environment do not make sense (message could be delivered days later). However, almost all the instances I've seen messaging being used has been in near-synchronous environment where a response is expected quickly (within a minute or two, usually within few seconds). In such an instance, it would be nice to know if the receiver failed in it's task. The receiver and sender each do their local JMS txns but it would be nice to have them participate in a global txn...perhaps send the Xid in the message?

    JMS isn't suited for this because it intentionally *decouples* producers and consumers of messages. That is, the JMS system works as the intermediate instance between them and ensures guaranteed delivery, even if it occurs later.

    JMS producers and consumers of the same message(s) can't be part of the same transaction.

    XA is no more, no less a protocol to perform a distributed transaction on more than 1 data source, e.g. 2 different databases or a messaging system and a database and another database and a XA-capable EIS, CICS, whatever. Basically it is a protocol between a transaction manager and its involved data sources.

    -- Andreas
  20. Distributed JMS transactions[ Go to top ]

    I agree, in the request-response scenario. Some other application needs to process the message. If the message is never committed on to the destination, how can the other end process the message in order to return the correct reply. This scenario is described in my article here: http://www.theserverside.com/resources/article.jsp?l=JMSArchitecture

    Workflow systems based on asynchronous architecture uses compensating transactions as stated in an earlier post
  21. XA Exposed: How 2 Phase Commit works in J2EE[ Go to top ]

    "a way that adheres to the ACID tenents."

    Do you mean tenets? I'm thinking of the loud house-music listening tenants who used to live next door to me in a duplex

    Very good article - looking forward to subsequent parts!
  22. XA Exposed: How 2 Phase Commit works in J2EE[ Go to top ]

    Damn, word juxtapositions and homonyms and the like are going to be the death of me.....there are a handful of words and phrases that I seem to have a mental block about and _always_ get wrong, and "tenets" is one of them (along with similiar, seperately, and some others).

    Sorry if grammatical errors are making the article hard to read, although on the plus side they all seem to, well - juxtapose interestingly with the rest of the text. I'll try to adjust my tinfoil hat and see if I can block those CIA mind control emissions a little better for the next article.

        -Mike
  23. "Disk forcing" occurs at resource managers that are tx aware anyway. Are you sure that one more "disk forcing" at TM is such a performance killer?

    One could use a TM that doesn't do tx log or automatic recovery (JBOSS or JOTM). Manually resolving in doubt tx over resource managers could be done with no hassle considering real scenario crashes.

    Regards,
    Horia
  24. "Disk forcing" occurs at resource managers that are tx aware anyway. Are you sure that one more "disk forcing" at TM is such a performance killer?

    >

    It depends on the size of the log, which depends on lots of factors such as what's saved in the log (before/after images?, do/undo operations? participant references? etc.) and the size of the stable-storage block.

    There's also the effect of whether presumed-abort/presumed-nothing/presumed-commit is used by the transaction service to consider.

    > One could use a TM that doesn't do tx log or automatic recovery (JBOSS or JOTM). Manually resolving in doubt tx over resource managers could be done with no hassle considering real scenario crashes.
    >

    If the TM doesn't do logging then it can't be said to be conforming to the ACID properties. The Atomicity part of that means that you get all or nothing semantics irrespective of failures (ignoring catastrophic disk failures, but you could always replicate to improve failure resiliency anyway). If you don't log, then a "manual recovery" can't tell what to do: logging tells you (whether "you" is the transaction manager or a sys. admin) which way to jump when you do recovery. It also helps you to determine which participants were in the transaction so that you can contact them in the first place.

    If you think about it, without *something* keeping information on the status of the transaction (whether it will commit or rollback) and who's in the transaction, there's very little you can do for recovery, especially in a distributed system. You could argue that some of this information may be maintained at the application level, but it still needs to be force-written to disk, and once you start getting into splitting this information across multiple writes, it gets to be a performance killer.

    Many companies, IBM, Transarc, IONA, HP to name a few, have spent/continue to spend a lot of effort on optimising the log service. Back in the early 90's for example, Transarc (which produced Encina/Encina++) had almost 50% of their developers working on the log service.
  25. It depends on the size of the log, which depends on lots of factors such as what's saved in the log (before/after images?, do/undo operations? participant references? etc.) and the size of the stable-storage block.

    >
    > There's also the effect of whether presumed-abort/presumed-nothing/presumed-commit is used by the transaction service to consider.

    Yes, but what performance penalty is introduced by XA? Log performance at resource manager level is an issue that exists anyway.

    > If the TM doesn't do logging then it can't be said to be conforming to the ACID properties. The Atomicity part of that means that you get all or nothing semantics irrespective of failures (ignoring catastrophic disk failures, but you could always replicate to improve failure resiliency anyway). If you don't log, then a "manual recovery" can't tell what to do: logging tells you (whether "you" is the transaction manager or a sys. admin) which way to jump when you do recovery. It also helps you to determine which participants were in the transaction so that you can contact them in the first place.

    Ok, but this can raise another issue: what to do in a "cannot commit / cannot rollbak / heuristic rollbak" scenarios? Stop the system? Initiate a automatic online/offline recovery? Recovery with compensation tx?

    > If you think about it, without *something* keeping information on the status of the transaction (whether it will commit or rollback) and who's in the transaction, there's very little you can do for recovery, especially in a distributed system. You could argue that some of this information may be maintained at the application level, but it still needs to be force-written to disk, and once you start getting into splitting this information across multiple writes, it gets to be a performance killer.

    Agree. Manual "blind" recovery can't be done. You need "some" info about the crashed tx. I forgot that in my case I have this info at application level.

    I've never said the XA is faster the non-XA tx. Can anybody provide any numbers about performance penalty XA introduces? IMHO, only then we can tag XA as being a performance KILLER.

    Regards,
    Horia
  26. It depends on the size of the log, which depends on lots of factors such as what's saved in the log (before/after images?, do/undo operations? participant references? etc.) and the size of the stable-storage block.

    > >
    > > There's also the effect of whether presumed-abort/presumed-nothing/presumed-commit is used by the transaction service to consider.
    >
    > Yes, but what performance penalty is introduced by XA? Log performance at resource manager level is an issue that exists anyway.
    >

    In general it's wrong to tag this as an XA problem. The principles behind XA (basic 2PC transactions) have existed for a long time and were used in products before XA came onto the scene. What I've been talking about is fundamental to the 2PC transaction model, of which XA is one possible implementation.

    You're right that logging at the RM level exists anyway, whether you're using XA or some other standard/proprietary implementation of transactions. Just the same as the logging at the coordinator isn't specific to XA.

    > > If the TM doesn't do logging then it can't be said to be conforming to the ACID properties. The Atomicity part of that means that you get all or nothing semantics irrespective of failures (ignoring catastrophic disk failures, but you could always replicate to improve failure resiliency anyway). If you don't log, then a "manual recovery" can't tell what to do: logging tells you (whether "you" is the transaction manager or a sys. admin) which way to jump when you do recovery. It also helps you to determine which participants were in the transaction so that you can contact them in the first place.
    >
    > Ok, but this can raise another issue: what to do in a "cannot commit / cannot rollbak / heuristic rollbak" scenarios? Stop the system? Initiate a automatic online/offline recovery? Recovery with compensation tx?
    >

    In the original strict 2PC model there aren't heuristics and you've got to do what you promised. So, for example, if a participant said it could commit during prepare, it has to commit when told to later by the coordinator. However, 2PC is a blocking protocol, which could result in participants remaining in an indeterminate state for long periods of time if the coordinator crashes. Hence heuristics were introduced to implementations (and ultimately specifications) to try to break that blocking situation in a controlled manner.

    Unfortunately there is no hard and fast rule as to what to do in a heuristic situation (and not being able to commit when told to is a heuristic). So you'll not find any specification that actually says "this is what you do to fix it". Some standards/implementations are fairly poor on heuristic handling, whereas others try to log as much information as possible to help. But the end result is that someone somewhere (usually a sys. admin) has to resolve the heuristic outside of the system simply because in order to do this resolution typically requires intimate semantic/syntactic knowledge of what the application was doing and hence what it actually means to undo the heuristic. In some circumstances it may be better to do nothing.

    There are several large multi-national companies that employ tens/hundreds of people worldwide whose only role is to resolve heuristics as quickly as possible.

    > > If you think about it, without *something* keeping information on the status of the transaction (whether it will commit or rollback) and who's in the transaction, there's very little you can do for recovery, especially in a distributed system. You could argue that some of this information may be maintained at the application level, but it still needs to be force-written to disk, and once you start getting into splitting this information across multiple writes, it gets to be a performance killer.
    >
    > Agree. Manual "blind" recovery can't be done. You need "some" info about the crashed tx. I forgot that in my case I have this info at application level.
    >
    > I've never said the XA is faster the non-XA tx. Can anybody provide any numbers about performance penalty XA introduces? IMHO, only then we can tag XA as being a performance KILLER.
    >

    As I said above, I think the issues that have been discussed are not stricly XA specific. From everything I've seen so far, it's applicable to core TP technology. Unfortunately it is true that there's no such thing as a free lunch. If you want reliability in the face of failures, concurrent access, multiple datasources etc. then you have to do the sorts of things that transaction systems have been doing (and doing well) for decades. These TP implementations form the backbone of current enterprise systems and that isn't going to change.
  27. Horia: I've never said the XA is faster the non-XA tx. Can anybody provide any numbers about performance penalty XA introduces? IMHO, only then we can tag XA as being a performance KILLER.

    Yes. I would say, from my experience, that for short-duration transactions (i.e. little tiny updates etc.) that XA slows things down by about an order of magnitude. So if your system processes 1000 tx/sec right now, add XA and that number will be around 100 tx/sec.

    I've never seen a case where it didn't adversely impact performance in a big way.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  28. Horia: I've never said the XA is faster the non-XA tx. Can anybody provide any numbers about performance penalty XA introduces? IMHO, only then we can tag XA as being a performance KILLER.

    >
    > Yes. I would say, from my experience, that for short-duration transactions (i.e. little tiny updates etc.) that XA slows things down by about an order of magnitude. So if your system processes 1000 tx/sec right now, add XA and that number will be around 100 tx/sec.
    >
    > I've never seen a case where it didn't adversely impact performance in a big way.
    >

    When you quote 1000 tx/sec what was the engine doing the transactions before XA got in the loop? What was the protocol/standard/implementation, for example? Was it really doing all of the necessary transaction work?
  29. Mark Little: When you quote 1000 tx/sec what was the engine doing the transactions before XA got in the loop? What was the protocol/standard/implementation, for example? Was it really doing all of the necessary transaction work?

    When there is only one resource, that resource can shoulder the transactional requirements, and the application server needs only indicate boundaries, and may even do so using single-PC processing (i.e. begin followed by either commit or rollback.)

    In other words, the application server doesn't need an "engine" if it's not using XA -- all it needs is a gas pedal (commit) and a brake (rollback). Sorry for the crappy analogy ;-)

    When more than one resource is employed within a transaction, and the resources need to have their states coordinated, then XA can be used. Now the application server does need its own "engine" to use your terms, and the engine sucks a lot of gas and responds very slowly (if it is recoverable) because it has to ensure that certain steps in the protocol are logged in a recoverable manner (i.e. forced to disk.)

    If the prepares and commits are both forced, then the overhead of 2PC XA to single-PC non-XA is at least 4:1, and in some implementations is 6:1 (intent, resource log, result, command, resource log, result.) In both cases, I'm assuming that the resource will always force its log, just to be fair (b/c if it doesn't force its log, then all other recoverability bets are off.) And those ratios are for the efficient implementations that batch and parallelize their resource operations -- it could be much worse for 2PC XA if resources are not invoked in parallel and if logs are not segment-batched. Add the additional network traffic and associated latencies, and it's quite easy to get close to a 10:1 performance difference. However, I think that a good chunk of the latency is just due to poorly written resource drivers and even more poorly tuned XA implementations.

    I'm not an expert on XA, but I did author an entire OTS/XA+ TM with single- and 2PC commit options (including full nested transaction support as provided for by OTS) and optimizations for resource ordering particularly for expected read-only resources etc.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  30. <Cameron Purdy>
    If the prepares and commits are both forced, then the overhead of 2PC XA to single-PC non-XA is at least 4:1, and in some implementations is 6:1 (intent, resource log, result, command, resource log, result.) In both cases, I'm assuming that the resource will always force its log, just to be fair (b/c if it doesn't force its log, then all other recoverability bets are off.) And those ratios are for the efficient implementations that batch and parallelize their resource operations -- it could be much worse for 2PC XA if resources are not invoked in parallel and if logs are not segment-batched. Add the additional network traffic and associated latencies, and it's quite easy to get close to a 10:1 performance difference.
    </Cameron Purdy>

    If we compare a 2PC XA with 2 resources with 1PC non-XA with one resource, of course the ratio is big. But let's compare 2PC XA with 2 resources with 1PC non-XA with 2 resources. Here the network latencies are out of problem 'cause the delay affects both scenarios. And let's say we have a good TM that can paralelize RM operations. In this case the ratio can be brought to 3:1 (1 forced prepare + TM forced record + forced commit : 1 forced commit).

    <Cameron Purdy>
    However, I think that a good chunk of the latency is just due to poorly written resource drivers and even more poorly tuned XA implementations.
    </Cameron Purdy>

    Good thing, isn't is? :)

    Regards,
    Horia
  31. Horia,

    I didn't realize what you were asking. Yes, if you have multiple RMs in a transaction then I think you should use XA, except for very rare and special circumstances. I would suggest that, other than bad implementations, there is almost zero overhead to XA beyond the absolute minimum of what it required to coordinate multiple RMs in a recoverable manner. In other words, there's nothing more efficient when you have multiple black box RMs than the architecture provided by XA.

    The thing to avoid is multiple RMs in a tx. If you have to have them, then use XA. If you can find ways to avoid them (obviously without losing consistency of data) then you want to avoid them.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  32. What is the Bottleneck in XA?[ Go to top ]

    [..] the engine sucks a lot of gas and responds very slowly (if it is recoverable) because it has to ensure that certain steps in the protocol are logged in a recoverable manner (i.e. forced to disk.)


    I would like to understand if the bulk of the overhead is from the disk access itself (the actual disk write) or just the fact that the resource manages has to wait for the confirmation, and therefore must block, thus adding a context switch. Does the loss of performance come from the context switches, or from the disk write?
  33. What is the Bottleneck in XA?[ Go to top ]

    I would like to understand if the bulk of the overhead is from the disk access itself (the actual disk write) or just the fact that the resource manages has to wait for the confirmation, and therefore must block, thus adding a context switch. Does the loss of performance come from the context switches, or from the disk write?

    Context switches can be as low as hundreds or low thousands of CPU cycles. On a 1Ghz CPU that is a small fraction of a millisecond (perhaps even less than one microsecond.)

    Disk forces, when well optimized, can be brought down below 10ms. So a single disk force is probably more expensive than 1000 context switches.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  34. What is the Bottleneck in XA?[ Go to top ]

    Cameron, what you're saying matches my experience as well.

    In my own XA work, my transaction is pretty well optimized. It uses all NIO and double fixed sized logs, and a typical transaction only takes up a few hundred bytes of disk space. The logging code itself is all hand-tuned and hand written to get bytes out fast - no serialization or other automation. Forces are batched together in a tunable manner to amortize the disk force cost over as many concurrent transactions as possible. The tunable bit allows me to control how many transactions can get amortized at once and optional delays, which lets me throttle access to the disk so that I don't have one process dominate it too much.

    With that kind of code, non-forcing I/O has shown to be inconsequential - it's well under a millisecond. But the forces, well, they carry alot of force. I've tried out my system on a variety of disks, from garden variety disks that come with workstations all the way up to massive EMC RAID arrays with mega (actually, giga :-) RAM caches. On the worst disks, I see forced I/O times around 30-50 milliseconds. On the massive and hideously expensive (but worth every penny!) EMC arrays, I see disk force times between 5 and 10 milliseconds. Variations happen on the EMC side because these mothers are big enough that they have to be shared among systems and I can never test in 100% isolation.

    With really small message payloads and reasonable concurrent publishing (say, 20 publishers going full out) under the best conditions I can concoct, a single server can get about 250 transactions/second. CPU utilization in these tests stays under 50%, while disk I/O's show around 175 IO ops/sec. As you can see, the EMC I/O is fast enough that it's hard to pull much concurrency out of it - it's tough to find multiple transactions to share a disk commit, but I might be able to get better throughput if I look at many tests carefully.

    I also have a magic hidden option, which I refuse to divulge to outsiders in detail, which lets me turn off disk forcing. This is used mainly to take the disk out of the equation when we need to for certain types of test, and also to debug some capacity sharing issues we've had with the EMC getting hit by other systems and cache thrashing a bit. Anyway, turn off disk forcing and throughput jumps up to around 800 trans/second - with most of this time being damnable serialization on both ends of the network sucking up almost all of the CPU. Get rid of serialization, and you can easily get at least 1,500 trans/second without trying too hard. Getting rid of serialization w/out disk forcing drops CPU utilization from my previous "under 50%" to practically non-existent, but throughput stays pegged around 250 tran/second.

    Note my JMS stuff also makes fairly aggressive protocol optimizations whenever it can. For example, when XA is switched on for a session, _all_ JMS client calls become fire and forget except for prepare(), commit(), and rollback(). This means I pay no penalty in the client (e.g. the app server) for start(), end(), and regular JMS activities. The downside is any errors occuring on the server side get deferred until prepare() time, which sucks a little for app developers trying debug problems.

         -Mike
  35. What is the Bottleneck in XA?[ Go to top ]

    I also have a magic hidden option, which I refuse to divulge to outsiders in detail, which lets me turn off disk forcing. This is used mainly to take the disk out of the equation when we need to for certain types of test, and also to debug some capacity sharing issues we've had with the EMC getting hit by other systems and cache thrashing a bit. Anyway, turn off disk forcing and throughput jumps up to around 800 trans/second - with most of this time being damnable serialization on both ends of the network sucking up almost all of the CPU. Get rid of serialization, and you can easily get at least 1,500 trans/second


    If you had a "grace period" that allowed you to actually defer the disk writes and still have durability, do you think you could still sustain 1,500/s?
  36. What is the Bottleneck in XA?[ Go to top ]

    Disk forces, when well optimized, can be brought down below 10ms. So a single disk force is probably more expensive than 1000 context switches.


    That's what I wanted to hear :) So, let's see, how would you achieve durability without using a disk ...
  37. What is the Bottleneck in XA?[ Go to top ]

    Guglielmo: That's what I wanted to hear :) So, let's see, how would you achieve durability without using a disk ...

    There's two systems I know of that accomplish it. One is the Z Series from IBM, which (unlike most parts of J2EE) apparently does XA (basically, all TM) very quickly, and with however many "nines" it provides. The other system is Coherence, which incidentally can run on the Z Series, or on a platform that is three orders of magnitude cheaper per MIP ;-)

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  38. What is the Bottleneck in XA?[ Go to top ]

    There's two systems I know of that accomplish it. One is the Z Series from IBM, which (unlike most parts of J2EE) apparently does XA (basically, all TM) very quickly, and with however many "nines" it provides. The other system is Coherence, which incidentally can run on the Z Series, or on a platform that is three orders of magnitude cheaper per MIP ;-)


    How does the Z accomplish it?

    And I assume Coherence accomplishes it by keeping replicas?
  39. What is the Bottleneck in XA?[ Go to top ]

    Guglielmo: How does the Z accomplish it?

    Everything in the Z Series is redundant. Even the acronyms are redundant.

    Seriously, everything gets processed two times in parallel, and at various stages the results are compared etc. to make sure that everything is perfect. I read once a figure like the mean time between any logical bit inside the whole system being set incorrectly is something like once every 400 years.

    Further, the Z Series has an IO bus wider than the highest speed PC memory bus (you know, dual or quad channel DDR or RDRAM running at insane Mhz etc. ... all have less throughput than the IO bus of the high-end Z Series.)

    Anyway, how it does the transaction logging is to some very reliable part of the LPAR, if I understand correctly. Billy Newport explained it to me recently .. but I forget the names/acronyms. Anyway, combine the ability to avoid disk forcing with the IO bandwidth of a 1000 lane highway and you get some impressive results.

    Guglielmo: And I assume Coherence accomplishes it by keeping replicas?

    Yes, using some configurable level of redundancy (not typically a full replica across the whole cluster, as that would kill scalability.) The durability of the operations then becomes the durability of the cluster itself.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  40. What is the Bottleneck in XA?[ Go to top ]

    Further, the Z Series has an IO bus wider than the highest speed PC memory bus (you know, dual or quad channel DDR or RDRAM running at insane Mhz etc. ... all have less throughput than the IO bus of the high-end Z Series.)


    It's probably like infiniband. I was just looking at Voltaires's implementation: 160Gbps, and 200ns latency. It's about $2,000 per host. I wonder if someone has created a "CF" for linux.

    >> Guglielmo: And I assume Coherence accomplishes it by keeping replicas?
    > Yes, using some configurable level of redundancy (not typically a full replica across the whole cluster, as that would kill scalability.) The durability of the operations then becomes the durability of the cluster itself.

    Two things:

    1) Tell me a little more about your implementation, so I understand what's really going on when the scalability is being "killed" :)

    2) Even if scalability were not an issue, *availability* changes with the number of replicas. I have a book at home which gives a reasonable formula for it, based on the mean-time-between-failures and the mean-time-to-recovery, I think. Basically the deal is that beyond N machines, availability actually decreases.
  41. What is the Bottleneck in XA?[ Go to top ]

    Guglielmo: It's probably like infiniband. I was just looking at Voltaires's implementation: 160Gbps, and 200ns latency.

    I don't think so, to be honest. I mean, the numbers may sound similar, but the mainframe IO buses are extremely wide and highly parallelized. Then again, it's been 15 years since I've done any real work on mainframes, and back then the 390 didn't support more than 16MB of physical RAM (and I had 32MB in my 486 :))

    Guglielmo: I wonder if someone has created a "CF" for linux.

    It's all a matter of "the number of nines". Mainframes are incredibly expensive, and have maintenance budgets bigger than the entire IT budgets of many companies! However, the uptime (I'm thinking about systems at some of the telcos that I've worked with) is 100% (i.e. at least six or seven "nines".)

    I'm sure that with a properly configured linux cluster, geographic distribution (2 or 3 sites minimum,) redundant connections and power sources etc. you could predict at least five "nines" ... but there's a difference between modeling predictions and guaranteeing availability. IBM still screws up (remember the recent Danish bank catastrophe?), and mainframes do crash, and mainframe software occasionally has cascading failures even in a redundant configuration, but it's as safe of a bet as I know of.

    Guglielmo: (regarding Coherence) 1) Tell me a little more about your implementation, so I understand what's really going on when the scalability is being "killed" :)

    It's a logical problem. If you have to guarantee a full replica, the cost increases slightly per machine, throughput being held constant.

    Throughput is not constant, however. Each machine you add is added for a reason, i.e. there is more work to be done. So desired throughput is a linear scale, meaning (for example) you get 2000 tx/sec out of two servers, so you want 4000 tx/sec out of four, right? With a full replica, each server is trying to pump out 1000 tx/sec but now each server is handling three other servers' outbound volume as well, so instead of spending 10% of the time on "housekeeping" (keeping up with the other servers' outbound volume,) each server spends 30% of its time on it. (BTW - you see the same issues with total-ordering network protocols.)

    Furthermore, full replicas require data distribution to the entire graph of nodes, typically implying multicast (for efficiency reasons.) Whether or not one uses multicast or point-to-point protocols, the network will increasingly become a bottleneck. Ethernet, for example, will suffer from an exponentially increasing percentage of dropped packets as it becomes saturated. TCP/IP will throttle itself back severely under network load. Blindly multicasting can cause multicast floods in this situation. Dynamic flow control (TCP/IP builds it in, UDP requires more work, like what Bela was talking about on the other thread) is necessary to avoid packet floods. So once a certain percentage of network bandwidth is utilized, there are quickly diminishing returns on the ability to fully replicate data, due to the network being a bottleneck.

    In other words, there are logical and physical reasons that cause full replica models to scale sub-linearly, and to hit an eventual brick wall (i.e. zero or negative scale.) We've proven this in our own lab tests quite convincingly ;-).

    Guglielmo: 2) Even if scalability were not an issue, *availability* changes with the number of replicas. I have a book at home which gives a reasonable formula for it, based on the mean-time-between-failures and the mean-time-to-recovery, I think. Basically the deal is that beyond N machines, availability actually decreases.

    That depends. Yes, i understand the concept, but it does have some underlying assumptions. Sun's recent paper on SunONE actually talked about this as well. (I learned quite a bit from it.)

    Our architecture is explicitly designed to avoid the scenarios that would lead to an inverse relationship between the size of the cluster and availability. In other words, we've decoupled the cluster nodes as much as possible to avoid these types of problems. For example, we explicitly avoid total ordering, reliance on specific nodes, and "centralized" data structures, such as directory services. These are all things that can introduce SPOFs, and even if you account for the SPOFs, these things can introduce complexity in failover and distribution that can lead to theoretically untestable (due to chaos theory) edge conditions.

    Further, the availability issue is affected by the length of time it takes to boot the cluster (or re-boot it.) The idea is that in a large cluster, it takes time to bring the whole thing up. However, in a peer-to-peer cluster, the cluster is "available" as soon as one node is up, and it is resilient to software (e.g. JVM) failure as soon as any two nodes are up, and resilient to server failure as soon as two nodes that are located on any two different servers are up. So depending on the specific application software, that could mean that a cluster has no worse recovery time regardless of the cluster size. (However, I agree that there are exceptions to this in the real world, so I do believe you are right in some situations.)

    The key to cluster reliability is to design out the edge conditions. In other words, the architecture must be such that edge conditions cannot exist. I can seriously say that we have spent ten times as long in architecting out edge conditions (long before we started any coding) than we have spent in implementation, and we still had to spend ten times as long in testing as we spent in implementation, plus many tens of thousands of CPU-hours spent trying to disprove our architectural solution.

    I am fond of saying that you cannot actually prove that distributed computing works, but that you can only try to prove that it doesn't work, and then fail to prove that it doesn't work. In other words, one cannot prove that clustered services (e.g. Coherence) actually work under all conditions and scenarios. (Coherence itself is a peer-to-peer clustered SOA from the ground up.) It's impossible to prove because with a network and multiple separate computers (not to mention multiple processes each with multiple threads), there is a level of indeterminability that is impossible (given finite computing resources) to even model! Which packets will collide? When will GC occur? How long will it take? Will the packet buffers overflow during GC because it takes longer than a couple of milliseconds? Which packets will not be delivered? Which will be corrupted? Which will be out of order? If there are three nodes (A,B,C) and A sends something to both B and C, then B "responds" to A and cc:s C, how often will the response packet from B arrive at C before the request packet from A arrives at C?

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  42. What is the Bottleneck in XA?[ Go to top ]

    Guglielmo: It's probably like infiniband. I was just looking at Voltaires's implementation: 160Gbps, and 200ns latency.

    >>
    > I don't think so, to be honest. I mean, the numbers may sound similar, but the mainframe IO buses are extremely wide and highly parallelized.

    I don't know. What I know about Parallel Sysplex is from Gregory Pfister's book, and the architecture seems to be the same. The user-space program talks directly to the "nic", sends out the message, and waits for the response.

    The only big difference I think is that the mainframe processors actually have a special assembly instruction for doing this. But I don't know what difference this makes.
     
    > It's all a matter of "the number of nines". Mainframes are incredibly expensive, and have maintenance budgets bigger than the entire IT budgets of many companies! However, the uptime (I'm thinking about systems at some of the telcos that I've worked with) is 100% (i.e. at least six or seven "nines".)

    How many nines do you get with Coherence on commodity hardware?

    > I'm sure that with a properly configured linux cluster, geographic distribution (2 or 3 sites minimum,) redundant connections and power sources etc. you could predict at least five "nines" ... but there's a difference between modeling predictions and guaranteeing availability. IBM still screws up (remember the recent Danish bank catastrophe?), and mainframes do crash, and mainframe software occasionally has cascading failures even in a redundant configuration, but it's as safe of a bet as I know of.

    What do companies do about downtime due to software upgrades? Do telcos just have really stable software, or do they actually usable evolvable software, e.g. n / n+1 compatibility?

    I often wonder about this because I think the biggest sources of downtime are software upgrades (or bugs) and power failures. I must have read that somewhere.

    > It's a logical problem. If you have to guarantee a full replica, the cost increases slightly per machine, throughput being held constant.

    But the specifics will vary a lot with the implementation.
     
    > Throughput is not constant, however. Each machine you add is added for a reason, i.e. there is more work to be done. So desired throughput is a linear scale, meaning (for example) you get 2000 tx/sec out of two servers, so you want 4000 tx/sec out of four, right?

    That's assuming a workload that has zero shared data. With shared data that's not what I expect.

    I think Pfister said that when Parallel Sysplex first came out they ran a test with a fully shared workload, and they had 13% overhead. So the scalability was linear, but the rate of increase was 13% smaller.

    Even though that's really good performance he says that the press concentrated on the overhead, not on the fact that you get linear scalability. In this regard he also mentions the Lou Gestner Dog Joke, which was funny :)

    > With a full replica, each server is trying to pump out 1000 tx/sec but now each server is handling three other servers' outbound volume as well, so instead of spending 10% of the time on "housekeeping" (keeping up with the other servers' outbound volume,) each server spends 30% of its time on it.

    >(BTW - you see the same issues with total-ordering network protocols.)

    Absolutely. I think there are other advantages, but that's for another time.

    > Furthermore, full replicas require data distribution to the entire graph of nodes, typically implying multicast (for efficiency reasons.) Whether or not one uses multicast or point-to-point protocols, the network will increasingly become a bottleneck. Ethernet, for example, will suffer from an exponentially increasing percentage of dropped packets as it becomes saturated. TCP/IP will throttle itself back severely under network load. Blindly multicasting can cause multicast floods in this situation. Dynamic flow control (TCP/IP builds it in, UDP requires more work, like what Bela was talking about on the other thread) is necessary to avoid packet floods. So once a certain percentage of network bandwidth is utilized, there are quickly diminishing returns on the ability to fully replicate data, due to the network being a bottleneck.

    Sounds very familiar.

    > In other words, there are logical and physical reasons that cause full replica models to scale sub-linearly, and to hit an eventual brick wall (i.e. zero or negative scale.) We've proven this in our own lab tests quite convincingly ;-).

    Can you write down the results of your tests here?

    > Our architecture is explicitly designed to avoid the scenarios that would lead to an inverse relationship between the size of the cluster and availability. In other words, we've decoupled the cluster nodes as much as possible to avoid these types of problems.

    Do you have studies of the availability of a Coherence-based system? So I can follow the details?

    > Further, the availability issue is affected by the length of time it takes to boot the cluster (or re-boot it.) The idea is that in a large cluster, it takes time to bring the whole thing up. However, in a peer-to-peer cluster, the cluster is "available" as soon as one node is up, and it is resilient to software (e.g. JVM) failure as soon as any two nodes are up

    I think the mean-time-to-recovery could be considered the time it takes the system to reach peak performance again. I guess one could define availability as the probability that a tx will take less than X ms.

    > The key to cluster reliability is to design out the edge conditions. In other words, the architecture must be such that edge conditions cannot exist.

    When you say "edge conditions" do you mean the "worst-case behavior"?
     
    > I am fond of saying that you cannot actually prove that distributed computing works, but that you can only try to prove that it doesn't work, and then fail to prove that it doesn't work. In other words, one cannot prove that clustered services (e.g. Coherence) actually work under all conditions and scenarios. (Coherence itself is a peer-to-peer clustered SOA from the ground up.) It's impossible to prove because with a network and multiple separate computers (not to mention multiple processes each with multiple threads), there is a level of indeterminability that is impossible (given finite computing resources) to even model! Which packets will collide? When will GC occur? How long will it take? Will the packet buffers overflow during GC because it takes longer than a couple of milliseconds? Which packets will not be delivered? Which will be corrupted? Which will be out of order? If there are three nodes (A,B,C) and A sends something to both B and C, then B "responds" to A and cc:s C, how often will the response packet from B arrive at C before the request packet from A arrives at C?

    I do sympathize with you :) But if you allow me this is why I like totem. It has predictable performance. The thruput you can pretty much estimate from a simple test (even a single-node ring) and the latency follows a linear distribution between L and N * L, where L is the time it takes to send a multicast.

    Incidentally, yesterday I took out the slow machine from my three-ring test, and I got a big increase in thruput.
  43. What is the Bottleneck in XA?[ Go to top ]

    Guglielmo: How many nines do you get with Coherence on commodity hardware?

    There are quite a few 24x7 applications running on Coherence now, and we have had very few reported cases of production downtime (few enough to count on one hand with fingers left over.) However, the product has only been "out in the wild" for two years now, so I don't think we really have enough data to give a true estimate. For example, our customers all have different startup times, so if a cluster does go down (e.g. accidentally for whatever reason, or purposefully for maintenance or something) then the amount of time that it takes to come back up will affect the actual number of "nines". Several of our customers load their entire data sets into partitioned caches for example ... that can take tens of minutes to load, even with a parallel grid loader.

    In that recent Sun paper, they had all sorts of risk modeling tools and what-not that we have never used (and I've never even seen,) so any number I give is really just a guess. I'd say in a good datacenter (redundant pipes and power + good monitoring) and with good planning (meaning elimination of other SPOFs and methods for minimizing or eliminating downtime due to periodic maintenance and upgrades), you would be able to obtain five nines. We do have government and financial services customers that have been running on Coherence for a long time without a single second of downtime.

    I'll add that the number of "nines" is about predictability. A lot of products brag about five nines (or more,) even if that haven't sold a copy yet or been in production more than 30 days, but talk is cheap (including mine,) and actually achieving a predictable five nines is extremely challenging. Five nines means that you get a couple of minutes in an entire year that the app is not available. That's it!

    As for the servers being commodity hardware, that actually doesn't affect us much, because we don't mind if the servers fail. For us, that is the point of clustering -- all of our clustered services are designed to survive server failure without data loss.

    Guglielmo: What do companies do about downtime due to software upgrades? Do telcos just have really stable software, or do they actually usable evolvable software, e.g. n / n+1 compatibility? I often wonder about this because I think the biggest sources of downtime are software upgrades (or bugs) and power failures. I must have read that somewhere."

    A lot of core telco software is ancient, so it's pretty stable. In environments like that, hardware and software are designed and tested for high availability and redundancy, and they've had decades to work out the kinks. They pay out the nose for it, often 100x what you would pay for similar capability. They don't have power outages, because they have redundant power suppliers plus batteries plus generators.

    I think 99% of downtime is caused by software bugs. (According to my Pentium, 100.000000001% is caused by software, but I think it's lying.)

    Guglielmo: When you say "edge conditions" do you mean the "worst-case behavior"?

    No. I am referring to any scenario that is "off the beaten path". You know, OutOfMemoryError from any new() or new[], or InterruptedException from any wait, or basically any exception that is actually exceptional. Those are just examples. Edge conditions traditionally refer to inputs that are close to the boundaries, such as number inputs that are close to 2^31 or -2^31 or -1 or 0 or 1 (i.e. numbers that are close to some "edge".) That's because 99.99999% of the time, everything is boring and OK ... it's the other .000001% of the time that the exception occurs and goes into the code that is commented as "this cannot ever get here" or "this will not happen" or "TODO" or "printStackTrace()".

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  44. What is the Bottleneck in XA?[ Go to top ]

    No. I am referring to any scenario that is "off the beaten path". You know, OutOfMemoryError from any new() or new[], or InterruptedException from any wait, or basically any exception that is actually exceptional. Those are just examples. Edge conditions traditionally refer to inputs that are close to the boundaries, such as number inputs that are close to 2^31 or -2^31 or -1 or 0 or 1 (i.e. numbers that are close to some "edge".) That's because 99.99999% of the time, everything is boring and OK ... it's the other .000001% of the time that the exception occurs and goes into the code that is commented as "this cannot ever get here" or "this will not happen" or "TODO" or "printStackTrace()".


    I see. Very good point. I guess one way to keep bugs out would be to keep the critical code as small as possible. In OSS they use peer review, I guess, by those who bother to read the code.
  45. What is the Bottleneck in XA?[ Go to top ]

    Several of our customers load their entire data sets into partitioned caches for example ... that can take tens of minutes to load, even with a parallel grid loader.


    If I may - do you transfer the state from one node to the other (during startup, I mean) or do you load it from the database?

    I think one of the thorniest problems is doing state transfer to a new node that is joining the cluster, because it has to be done without hurting performance too much, and yet it has to go fast enough that the new node does catch up eventually.
  46. Guglielmo: I see. Very good point. I guess one way to keep bugs out would be to keep the critical code as small as possible. In OSS they use peer review, I guess, by those who bother to read the code.

    Yes, peer review is an essential part of the quality process, and that includes peer review of: architecture, design, coding, unit tests, fixes, documentation, etc. OSS projects with multiple overlapping developers often get peer review de facto because the developers have no choice but to look at each others code. The more critical (=important) and popular modules can end up with many eyes on them. Coherence is not open source, but we do use (=require) peer review extensively, both internally, and also with some of our customers that review the specific parts of the product that they are most interested in.

    As far as keeping the critical code small, that is pretty much what one does to remove edge conditions. I would describe it more as a "funnelling" process in the architecture (carried several steps further in the actual design) where things that could become edge conditions are forced into pre-defined routes, so that the potential states of the machine are all forcibly pre-defined. Designing such a state machine "by hand" is incredibly complex, but it pays off in spades, because (if you do it right) you will end up with only known states, and thus "answers for every question."

    Guglielmo: If I may - do you transfer the state from one node to the other (during startup, I mean) or do you load it from the database?

    Remember, it's an SOA. The first service that a node joins is the clustering service itself, which only has "data" about the other nodes and services. When a node joins a cache service, if it is a replicated service, then the data is transfered as part of the join, so once the service is "joined" the data is there (has already been replicated.) If it joins a partitioned service, then the data is not there at join time, but the dynamic repartitioning will have been initiated by the join, if the configuration warrants it; in other words, the data is not re-partitioned yet, but the wheels of that process have started moving.

    Guglielmo: I think one of the thorniest problems is doing state transfer to a new node that is joining the cluster, because it has to be done without hurting performance too much, and yet it has to go fast enough that the new node does catch up eventually.

    Yes, that's all true, and more. Fortunately, we have ways of dealing with precisely these issues, by prioritizing work that the services are doing. For example, if a node running a partitioned service goes down, a risk evaluation is performed to determine if any data is in danger of being lost, such as would be the case if there is only one level of backup for the data, and if it is, then a high priority backup task is begun, and clients would slow down (not stop) if they conflicted with that task. Similarly, if the service determines that the partitions are not balanced, it will start a balancing task, but it will have a lower priority than the backup would, because no data is actually in danger.

    Prioritization is essential to avoid out of memory conditions and network floods. In a way, it is similar to flow control on a network. If too much is being attempted at one time, then nothing gets done. Prioritization allows you to control the incoming flow of work, for example the work being requested through the various service APIs, without dropping the ball on the critical work that is going on behind the scenes that ensures the cluster and data integrity.

    I'll make up an example. Pretend that there's some asynchronous transactional data cache, and client threads are constantly stuffing data into it, lots of threads stuffing lots of data, but the prepare/commit processing is synchronous, meaning it has to get all that data "ensured by the cluster" and the commit processing has to meet the ACID requirements. If something goes wrong (e.g. a server dies) and the data has to be moved around the network to ensure its safety (backups, etc.), that re-org is going to impact the ability of transactions to commit, because the base against which they are committing is in flux. So all these threads are pumping data in, but it's not getting committed, so the pending commits are piling up, and they are in-memory, and obviously in some form queued for network delivery as well. This could lead to an out-of-memory condition, and if it did, that condition could cascade because it could kill one JVM which would cause another re-org to occur, etc. In other words, if you can't prioritize work, it becomes very easy for something relatively simple and accounted for (failover) to become a cascading failure that will kill an entire cluster.

    Anyway, under normal conditions, it doesn't really seem to matter, because well over 99% of the time, nothing goes wrong with any piece of software. In fact, most companies don't have any idea that something is wrong with their application availability strategy until a server crashes in production, and the failover fails and takes down all the other servers or loses important data, for example. Then they're completely screwed, and it's too late, and their claimed "five nines" availability is all of a sudden "four nines" or "three nines," and they realize that they have no way to actually predict availability.

    I won't name any names, but in one test we performed with another Java-based clustering / data replication package (and with the help of one of the developers of that package,) handling the replication of only 10MB of data caused exactly this type of cascading failure (OutOfMemoryError), even when we gave it 256MB of heap space (-Xmx=256m just to handle 10MB of data.) Further, it ended up flooding the network, causing other things to go wrong (connections lost to database servers, WAN links down, etc.) so not only did the entire cluster die, but it took down other processes and network devices with it! (For comparison purposes, Coherence handled the same scenario without interruption with only 11MB of total heap space (-Xmx=11m), even with multiple client threads doing other operations at the same exact time.)

    Anyway, I don't mean to imply that there's any perfect software. We do everything we can to ensure the quality of our software, and we (and unfortunately, sometimes our customers) still find things that are just plain wrong. Our goal though is to provide software that enables companies to achieve 100% uptime as part of their availability strategy, and we're steadily working to achieve that goal.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  47. What is the Bottleneck in XA?[ Go to top ]

    99.99999% of the time, everything is boring and OK ... it's the other .000001% of the time that the exception occurs and goes into the code that is commented as "this cannot ever get here" or "this will not happen" or "TODO" or "printStackTrace()".


    One way to do it is to use the fail-stop model. Personally, I never swallow exceptions and, I throw RuntimeExceptions if the problem is so bad that it's better for the code to die than to generate a secondary bug.
  48. <snip>
    It's a logical problem. If you have to guarantee a full replica, the cost increases slightly per machine, throughput being held constant.
    </snip>

    Replicas serve two purposes- data durability (when a node crashes) and faster data access (when accessed from any node, the replicated data that is available locally is served).

    In Homogenous clusters, where there is an expectation of data availability (accessed) on all the nodes of the cluster (non partitioned), wouldnt it be faster to replicate to a single failsafe known point (say, a master DB or in a pure-cache solution to a known replication point for all data) rather than to other nodes in the cluster. On each of the nodes there is just a cache invalidation mechanism if the same data (that was modified elsewhere and updated in the failsafe "known point") is in the local cache?

    Cameron, How does Coherence handle such accesses?

    Cheers,
    Ramesh
  49. Ramesh: Replicas serve two purposes- data durability (when a node crashes) and faster data access (when accessed from any node, the replicated data that is available locally is served).

    Yes, very true. The failover is very simple (except for "stitching" in-flight data back together) and the access latency is zero milliseconds (in-memory speed.)

    Ramesh: In Homogenous clusters, where there is an expectation of data availability (accessed) on all the nodes of the cluster (non partitioned), wouldnt it be faster to replicate to a single failsafe known point (say, a master DB or in a pure-cache solution to a known replication point for all data) rather than to other nodes in the cluster.

    Maybe. You have that option with Coherence. That's also very similar to how SpiritCache works, since it's based on a hub+spoke architecture (SpiritSoft was one of the early JMS vendors.)

    BTW - when I mention "partitioned," I am talking about automatic / transparent / dynamic partitioning of data. To the application, it still looks like all the data is in memory on each node, but it's actually chopped up (partitioned) and spread out over the whole cluster. So it's like what you described above, but instead of having "a single failsafe known point," we use the entire cluster to compose "a single failsafe known point" ;-). Further, you can add Near Caching to it in order to cut latency by keeping commonly-accessed data locally on multiple nodes.

    Ramesh: On each of the nodes there is just a cache invalidation mechanism if the same data (that was modified elsewhere and updated in the failsafe "known point") is in the local cache?

    That's more like our Near Cache technology, which supports different invalidation options: (1) data expiration (no network traffic, just timing out the data in the local cache), (2) distributed sync or async key-based invalidation (starting with our 2.3 release, applicable to clusters of even several hundred servers) and (3) version-based invalidation (generally used with very large chunks of data in relatively small clusters of up to maybe 12 servers.)

    BTW - you specifically asked wouldnt it be faster to replicate to a single failsafe known point ... rather than to other nodes in the cluster, and the answer is "It depends." For example, if the data is needed on basically all nodes, and it isn't changing much, then what you described will end up looking like a fully replicated cache, with a latency hit on first access after modification, which means that it will be slower than a fully replicated cache. OTOH, if it's a lot of data (cache entries,) and it's changing often, and you have good optimizations for invalidations (to avoid global invalidations) and the read:write ratio is lower and the read spread is unpredictable, then you're probably going to see much better performance with what you've described.

    I'll add that there really is no such thing as a "failsafe known point." Even EMC arrays and Oracle databases go down. The key is to make sure that each tier in the application infrastructure is as independently resilient as it can be, and thus can contribute to the overall resiliency of the aggregate. That's why, for the database tier, it's hard to beat Oracle (or DB2 if you're a blue shop, or maybe Oracle+Veritas for certain infrastructure "challenges,") and for the storage tier, it's hard to beat EMC, because these products provide really nice options for achieving HA in the real world. Similary, IBM MQ Series for enterprise messaging (only slightly faster than the US Postal Service, but highly dependable.) However, we're talking about multi-million-dollar infrastructures now, which are a bit out of the reach of most of us.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  50. Yawn.

    Mr. Purdy, would you be so kind to not turn every discussion thread into a marketing thread for you cache product? It's really boring and quite enough to see your [and context sensitive but stupid signature on every thread.
  51. \Andreas Mueller\
    Mr. Purdy, would you be so kind to not turn every discussion thread into a marketing thread for you cache product? It's really boring and quite enough to see your [and context sensitive but stupid signature on every thread.
    \Andreas Mueller\

    I disagree in this instance. The information is very on topic for this thread, provides a very fascinating insight into some very advanced distributed caching techniques, and somebody asked him how it works. Mebbe Cameron can be slightly overzealous where Coherence is concerned, but not in this thread.

        -Mike
  52. I disagree in this instance. The information is very on topic for this thread

    This is not on topic, Mr. Spille. The topic is XA.

    somebody asked him how it works.

    Guerilla Marketing is popular these days...
  53. I disagree in this instance. The information is very on topic for this thread

    >
    > This is not on topic, Mr. Spille. The topic is XA.
    >
    > somebody asked him how it works.
    >
    > Guerilla Marketing is popular these days...

    I am with Mr. Spille on this. it doesn't matter for me if the topic is shifted to something else as long as new discussion is educating and its quality is high. I personally learned a lot...

    thanks

    -talip
  54. <snip>
    > I disagree in this instance. The information is very on topic for this thread
    > This is not on topic, Mr. Spille. The topic is XA.
    > somebody asked him how it works.
    > Guerilla Marketing is popular these days...
    </snip>
    To be fair to Cameron, he was ONLY responding to my specific pointed questions. Though the questions were on Coherence, the subject matter is generic and applicable to any distributed caching solution (including ours- in Pramati Server).

    Cheers,
    Ramesh
  55. off-topic[ Go to top ]

    Andreas: Mr. Purdy, would you be so kind to not turn every discussion thread into a marketing thread for you cache product?

    First, please don't call me Mr. Purdy. My dad is Mr. Purdy. I'm just Cameron, or maybe you could get away with just "Cam" if you're really cute. Somehow I doubt it though.

    Second, yeah, I probably did get a bit too far out on a marketing limb, but I wasn't trying to, it just happened a little bit at a time, as the thread progressed. You know, like when you wake up with a headache in the morning and you're still wasted drunk and you don't know where you are? Not that I've ever done that, but I saw it happen once in a movie. Anyway, you're right, that a lot of what I was describing wasn't related to XA. I'll go back and mark them as noisy for you.

    Mike: The information is very on topic for this thread, provides a very fascinating insight into some very advanced distributed caching techniques, and somebody asked him how it works. Mebbe Cameron can be slightly overzealous where Coherence is concerned, but not in this thread.

    Yes, overzealous. I do seem to get all excited talking about it. But if I tell you why, it would be marketing. And probably off-topic ;-)

    Peace,

    Cameron Purdy
    Some Company, Inc.
    Some Product (tm): Tag-line here!
  56. /Cameron Purdy/
    I probably did get a bit too far out on a marketing limb,
    /Cameron Purdy/

    I doubt anybody other than Andreas minds though. Personally, I don't care if you gain a bit from it since you've already earned it by way of enlightenment. Just reading this thread and the group comms. thread on the JBoss roadmap news topic, I've learnt about a whole new breed of applications and what kinds of tradeoffs are involved. This is great enlightenment for ignoramuses like me. (and I suspect good information for experts too.)
  57. <snip>
    I'll add that there really is no such thing as a "failsafe known point." Even EMC arrays and Oracle databases go down.
    </snip>

    By known-failsafe-point, was not referring to a single physical point. This would be a single logical point (required if this is to be absoutely failsafe). Kind of like the solution you describe where the known-point happens to be the current set of replicas for the partitioned-data. Only that as partitioning has costs of locating the data and additionally managing the local caches (if any)- even if the data is partitioned, it may be also cached in the local VM when qaccessed.

    A simple solution may be to not have local-caches. Always go to the current partition (any of the replicas) to fetch data when needed. But thsi would be suboptimal performance, if there is no local caching. But performance considerations will need local caches (in each VM/node). With local caches, managing partitioned masters will have its own costs. Having non partitioned master (even if it is replicated for failsafety), with local caches and effective cache invalidation is probably a better option.

    Again, as you said, optimal solutions do largely depend on the specific access dynamics.

    Cheers,
    Ramesh
  58. What is the Bottleneck in XA?[ Go to top ]


    > How does the Z accomplish it?

    Logging is done through MVS system logger to coupling facility list structures. The list structures that logger manages are essentially a cache to hold short-lived log records. Considering the CF's are typically duplexed, have battery backups, and are connected to multiple systems, the data held within the CF's are "durable" across failures and have very quick access times. If the list structure is filled because records are living longer than expected lives, the oldest records are then written to disk.

    Considering transaction DBT/CMT records are very short lived, on a correctly configured system, the records never leave the cache and get forced to disk. This allows for a very high tx thoughput rate.

    In recovery scenarios, if a system (or logger subsystem) connected to a CF fails, another instance of logger will take responsibility for the log records and write them to disk. As an added benefit, since the log is accessible from anywhere in the sysplex (cluster), log replay can occur on any system in the plex.

    This is the technology that the WebSphere for z/OS JTA implementation exploits. It's done indirectly through a transaction manager integrated into z/OS (RRS) and exploited by DB2, MQ, CICS, WebSphere, and IMS. As a transaction manager, RRS has a bunch of nifty features that are in many ways superior to XA.
  59. Disk forces, when well optimized, can be brought down below 10ms. So a single disk force is probably more expensive than 1000 context switches.

    >
    > That's what I wanted to hear :) So, let's see, how would you achieve durability without using a disk ...

    Anyone tried this on a solid-state Flash RAM disk? They're expensive but VERY fast, and you don't need that much space for transaction logs, right?
  60. What is the Bottleneck in XA?[ Go to top ]

    Context switches can be as low as hundreds or low thousands of CPU cycles. On a 1Ghz CPU that is a small fraction of a millisecond (perhaps even less than one microsecond.)


    Incidentally, I found this web page that shows the distribution of the seek time for various hard disks:

    http://thibs.menloschool.org/~djwong/programs/bogoseek/graphs/img/

    I don't know enough about how drives work, but I think that they should basically have a few platters on a single motor, and a head for each platter.

    Assuming that when you force a (tiny) record to the log, the time it takes is basically the seek time. If you are just appending consecutively, then the head doesn't need to move radially, it just has to wait for the motor. And the average spinning time should be half the maximum spinning time. So for 15,000 rpms = 250 revolutions per sec, you get 4 ms average seek time to append.

    And since the motor can only spin to one place at any given time, the thruput should be 1 / 4ms = 250/s. Does that make sense? That does sound like the kind of numbers you have been talking about. And the latency is a few ms, say.

    Whereas with a replicated in-memory database, you'll get about the same latency but a thruput of thousands (or tens of thousands) of messages per second. So if you wait to write to the disk, then you can seek very rarely, and then blast one or more sectors at once - is that limited only by the HD buffer ? - which should be very fast (I have to look up the write speed.) I guess that's what Mike is talking about when he talks about batching disk forces.

    Mike, I apologize in advance if you already said this somewhere: how are you ensuring durability between disk forces again?
  61. What is the Bottleneck in XA?[ Go to top ]

    \Guglielmo Lichtner\
    Assuming that when you force a (tiny) record to the log, the time it takes is basically the seek time. If you are just appending consecutively, then the head doesn't need to move radially, it just has to wait for the motor. And the average spinning time should be half the maximum spinning time. So for 15,000 rpms = 250 revolutions per sec, you get 4 ms average seek time to append.
    \Guglielmo Lichtner\

    What I know about disk drive's you could fit on the head of an amoeba's hat pin.

    I do know your average everyday disks don't come close to 4 milliseconds for average force duration.

    You're right that optimized tran logs generally just do tiny appends, so there shouldn't be alot of head banging going on. However, the following does have to happen:

       - The JVM has to (maybe) get the data to the kernel, if you're not using NIO
       - Context switch to kernel
       - Go out to device driver
       - Go out to controller
       - Get data onto disk
       - Wind back up to kernel
       - Do whatever kernel housekeeping there may be in flushing a VM page
       - Come back to the VM

    Plain old disks seem to take 20-30 milliseconds to do this. RAID arrays do better.

    \Guglielmo Lichtner\
    I guess that's what Mike is talking about when he talks about batching disk forces.
    \Guglielmo Lichtner\

    This is simple at the Java level. If you happen to be using NIO, then you're doing to do a:

       channel.force (true); // mebbe false ....

    Plus, your access to your Tranlog has to be serialized, so you don't munge up different transaction data coming through it. What you want to avoid, conceptually, are something like 5 transactions happening more or less concurrently and doing this:

        Tran 1: write
        Tran 1: force
        Tran 2: write
        Tran 2: force
        Tran 3: write
        Tran 3: force
        Tran 4: write
        Tran 4: force
        Tran 5: write
        Tran 5: force
        
    The writes don't matter much, but the forces do. "Batching" of the forces does something like this:

        Tran 1: write
        Tran 2: write
        Tran 3: write
        Tran 4: write
        Tran 5: write
        Tran [1-5]: force

    Basically, ya do a bunch of writes which will average out as very fast indeed, and then disk force the bunch of them. Some of the writes may actually end up going all the way to disk anyway of their own accord, maybe even due to a past batched-up force, but the point is that if you batch things up this way you are getting the force guarantee in batches of transactions.

    This means that one force can cover many transactions.

    Overall, after the "force" point you know definitively what has gone to disk. Prior to force, you have some I/O that may or may not have gone out.

         -Mike
  62. What is the Bottleneck in XA?[ Go to top ]

    Plain old disks seem to take 20-30 milliseconds to do this. RAID arrays do better.


    Thanks for the rule of thumb.

    > Basically, ya do a bunch of writes which will average out as very fast indeed, and then disk force the bunch of them.

    Forgive me for dwelling on this, but what kind of speed-up are you getting by batching in your system?

    Also, to clear this up, the resource manager threads have to wait to return until the consolidated force is done, so there is no durability problem, right?

    What I was hinting at above was somehow achieving temporary durability (say by keeping replicas,) and then forcing, say, one megabyte at a time. Do you think that would make any difference? The nice thing about is that the resource manager can return right away - lower latency. I think this might be called NVRAM - there are lots of papers on it.
  63. What is the Bottleneck in XA?[ Go to top ]

    \Guglielmo Lichtner\
    Forgive me for dwelling on this, but what kind of speed-up are you getting by batching in your system?
    \Guglielmo Lichtner\

    Well, first of all latency _can_ go up a bit per individual thread. This extra latency can vary widely depending exactly on how you're batching, how fast your underlying disk forces are, and blind luck of how many batch "hits" you get vs. misses. If you've got a throttler built into your batching, which I generally do, this latency _can_ go up to 10-20 milliseconds in worst cases. But there's another side to this, which I'll go into in a minute.

    On the plus side, many transactions are sharing in this, so your overall throughput goes way up.

    In my own RM implementation, I need two disk forces per transaction - 1 for prepare(), one for commit(). My initial first cut implementation peaked out on a Plain Old Disk on an HP-UX L-class system around 29.<mumble> transactions per second - yes, just shy of 30 trans per second, meaning 60 disk forces per second, meaning a disk force took around 17 milliseconds. This by itself is rather sucky.

    But - overall latency could get pretty bad. One guy takes, let's say, 20 milliseconds to do a prepare. OK. What if 5 guys come in more or less simultaneously to disk force a prepare? In that instance, you see this:

        prep 1 - 20 millis
        prep 2 - 40 millis
        prep 3 - 60 millis
        prep 4 - 80 millis
        prep 5 - 100 millis

    Yikes - response time starts going through the roof, because you're serialized on a slow resource.

    Look at the batching side now, and let's say during peak periods you've also got 5 guys coming in more or less simultaneously:

        prep 1 - 30 millis
        prep 2 - 30 millis
        prep 3 - 30 millis
        prep 4 - 30 millis
        prep 5 - 30 millis

    Here, I'm taking some liberty and saying that a 10 milli overhead is happening from the throttler. In this case, all 5 guys share one disk force, and all 5 guys pop out at once 30 milliseconds later.

    The key here is that disk write() calls don't count, 'cuz for the most part they're buffered in memory somewhere and a write() call will rarely actually get anything to disk on average. For the 5 calls above, maybe I wrote out 2.5K - in fact, it may be much less, cause in reality some of those batched in will be commit() calls, which have really tiny overhead for me (something like 200 bytes, most of which is taken up by the damnably huge Xid).

    In addition - forcing isn't very proportional to data size when you're talking tiny sizes. Forcing 5 bytes will take about the same amount of time as forcing 1000 bytes.

    In my own tests, throughput w/ batching will vary based on how much throttling I put in, and how fast the disk is, and some other factors, but I can get throughput up to 150 trans/second - which is 300 force attempts. I'm only _really_ forcing around 40 times per second, but those 40 real forces cover 300 logical force requests.

    \Guglielmo Lichtner\
    Also, to clear this up, the resource manager threads have to wait to return until the consolidated force is done, so there is no durability problem, right?
    \Guglielmo Lichtner\

    Yep. Basically, the "consolidated force" forces for several guys all at once. In fact, if you have a high-resolution timer and fast logging happening somewhere in the background, you can actually see 7 or 8 prepare/commit calls (mixed in of course) completing within the same millisecond. The durably for all 7 or 8 are taken care of in 1 disk force call, instead of trying 7 or 8 individual calls.

     
    \Guglielmo Lichtner\
    What I was hinting at above was somehow achieving temporary durability (say by keeping replicas,) and then forcing, say, one megabyte at a time. Do you think that would make any difference? The nice thing about is that the resource manager can return right away - lower latency. I think this might be called NVRAM - there are lots of papers on it.
    \Guglielmo Lichtner\

    This is in effect what EMC arrays do. They have a multi-gigabyte RAM cache, and a "disk force" only (usually) forces to cache. The fact that they're RAID also means that data gets spread out over multiple drives, which reduces the actual rare flush-to-disk occurences.

    I've heard people talk about all RAM disks made up exclusively of non-volatile RAM. Sounds cool, but I haven't seen any of these things. Size isn't a problem when you're talking about XA transaction logs (I have some installations where the total log size is <5MB). But I wonder about write speeds and stability under heavy loads - if you're doing, say, 50 writes per second all day long, how well do one of these suckers hold up?

         -Mike
  64. What is the Bottleneck in XA?[ Go to top ]

    I've heard people talk about all RAM disks made up exclusively of non-volatile RAM. Sounds cool, but I haven't seen any of these things. Size isn't a problem when you're talking about XA transaction logs (I have some installations where the total log size is <5MB). But I wonder about write speeds and stability under heavy loads - if you're doing, say, 50 writes per second all day long, how well do one of these suckers hold up?


    I have never used a solid state disk either, I only know someone who did. But I found this link:

    http://www.superssd.com/products/ramsan-320/index.htm

    It's disk + SDRAM + battery. It says it does 250,000 I/Os per second.

    It would be interesting if you could borrow and see what it does to your system ;)
  65. What is the Bottleneck in XA?[ Go to top ]

    Just out of curiousity, why would you have to pause between forces to wait for more threads to log? I'm just thinking that under light load, you want to force immediately, and under heavy load, you will have enough parallel threads that there will always be multiple items to force (because multiple threads are always trying to prepare or commit.)

    In other words, it almost seems that the optimal API is something like:

    void log(data); // blocks until log is done and forced

    And the implementation is:

    sychronized (log) {
    // post the data to the queue
    log.add(data);

    // wait for the flush
    log.wait();
    }

    Then a forcing thread could always be taking the enqueued data, streaming it out, and forcing, and doing a this.notifyAll() when the force completes.

    Assuming that the log monitor is held from the point before the write begins to the point after the notifyAll completes, thread synch is taken care of.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  66. What is the Bottleneck in XA?[ Go to top ]

    \Cameron Purdy\
    Just out of curiousity, why would you have to pause between forces to wait for more threads to log? I'm just thinking that under light load, you want to force immediately, and under heavy load, you will have enough parallel threads that there will always be multiple items to force (because multiple threads are always trying to prepare or commit.)
    \Cameron Purdy\

    I think I explained it rather poorly. You don't _have_ to pause between forces. You can set it up so that disk forcing happens in a background thread, and this thread's activity is synchronized with "requestors" - either I/O or just disk force - which block on a wait waiting for the force to happen. When you're about to force in this background thread, you pull all the waiting requestors off to the side and associate them with the force attempt. Disk forcing then begins, and meanwhile other requestors can be coming in as well. These get queued up as the next round of forcers.

    You can code your background forcer thread to either wait for any requestors being queued up, and immediately force when it sees one - which will mean constant disk activity if there's always at least 1 guy waiting. Sort of like this, in kinda-sorta psuedo code (this is off the cuff and from memory!):

        while (active) {
           synchronized (lock) {
              while (requestorQueue.isEmpty()) {
                 lock.wait();
              }
              // Turn requestor queue into "force now queue", make new requestor queue
              forceNowQueue = requestorQueue;
              requestorQueue = new ForceQueue();
           }
           log.force();
           synchronized (forceNowQueue) {
               forceNowQueue.setDone();
               forceNowQueue.notifyAll();
           }
        }

    The above assumes two queues (well, not really queues, more like sets) one for requestors, one for guys whose data is being forced now. All activity is synchronized both by "lock" above and either requestorQueue or forceNowQueue.

    On the requestor side, they'd do something like:

       force ()
       {
          ForceQueue myQueue = null;
          synchronized (lock) {
             myQueue = requestorQueue;
             myQueue.add (this);
             lock.notify();
          }
          synchronized (myQueue) {
             while (!myQueue.done()) {
                myQueue.wait();
             }
          }
       }

    As you indicated, all this thread stuff can be hidden behind a blocking API from a "user" point of view.

    In addition, the force disk thread can be configured to throttle a bit. My own implementation supports this, to avoid banging on the disk too much. Since I have a cluster, all of whom are banging on a shared EMC along with the app server's transaction logs, it's desirable not to have one process hog the EMC. So you insert a configurable wait time in the force disk thread. Even if you only wait 10 milliseconds, this only slightly elongates the user-visible force times, but it gives the disk a breather so you don't saturate it.

    This approach is admittedly a bit simplistic, but it has the virtues of being easy to code and understand (relativey speaking), and gets alot of bang for the buck. A more aggressive and complex approach would be to measure disk force times and incoming request rates, and optionally throttle based on those (plus some configurable bias factor).

    As for seperating I/O from disk forcing, I've found it doesn't make much difference for my purposes if I/O is done w/ the forcing or done seperately with the right synchronization. This is probably because of the fact that:

      - I'm using NIO
      - The small size of the records being written
      - Alot of I/O is non-forcing, and is initiated by a fire-and-forget call from the client to the server.

    Combining non-forcing I/O coming out of fire and forget calls with later prepares() and commits() that do need forcing, and doing the forcing as "batched" seperately from the I/O, simplifies my code, and I benefit from doing a bit more background I/O work in the fire and forget calls. In fact I record the entire XA protocol in my transaction logs, and only pay a few percent penalty for recording this extra data. The payoff is it's alot easier to debug protocol problems, and to find timing problems 'cuz I can measure the length of various parts of the transaction.

         -Mike
  67. What is the Bottleneck in XA?[ Go to top ]

    Just curious:

    >   - I'm using NIO

    Then I guess you have a fixed length tx log, otherwise it wouldn't make much sense. How do you remember the actual length of the log (the last byte of the last written log rec so that you can continue [after a restart] at that point)? Do you have fixed size log records? What is if the last record is corrupt because of a crash? Further, XA log records live usually longer (indoubt tx). How do you maintain that?

    >   - The small size of the records being written

    Don't you write a before image for consumed (deleted) messages and an after image for produced messages?
  68. What is the Bottleneck in XA?[ Go to top ]

    \Andreas Mueller\
    Then I guess you have a fixed length tx log, otherwise it wouldn't make much sense.
    \Andreas Mueller\

    Yep. Actually, I have dual fixed length logs, one "active", the other "standby". When one side gets full I switch over to the other. They're pre-initialized at startup if no Tranlogs are there.

    \Andreas Mueller\
    How do you remember the actual length of the log (the last byte of the last written log rec so that you can continue [after a restart] at that point)?
    \Andreas Mueller\

    I don't remember it. Every record is written out w/ an "END_LOG" token at the end of its record, and subsequent records are written by backing up by "END_LOG" bytes (and hence overwriting it).

    The log is read on startup, if present, and is sort of "pretend replayed". Any info found without a prepare at the end of the read is presumed aborted, and thereby is actively rolled back. Any prepare record found without a commit is presumed to be a dangling in-doubt, and I hold onto that in-memory, waiting anxiously for the TM to call recover() on me....

    \Andreas Mueller\
    Do you have fixed size log records?
    \Andreas Mueller\

    There's a fixed size header and [optional] variable sized data area.

    \Andreas Mueller\
    What is if the last record is corrupt because of a crash?
    \Andreas Mueller\

    If I hit corruption, I backup to the last-known-good record. Corruption generally signals that the server crashed in the middle of writing. Thanks to the ordering of the XA protocol and presumed abort, I don't have to generally worry about such corruption. If it was a prepare that got corrupted, the tran will get rolled back. If it was a commit that got corrupted, I'll have the prepare hanging around, and will return that from my recover() method.

    \Andreas Mueller\
    Further, XA log records live usually longer (indoubt tx). How do you maintain that?
    \Andreas Mueller\

    If I hit the end of the active log, any in-flights (and only in-flights) are copied over to the secondary log. This copying is done carefully so that if I crash anywhere in the middle of it, data is still not lost. The log size can be configured so you have a trade off between how much disk space is being used vs. how often elongated response time spikes happen due to a log switchover.

    \Andreas Mueller\
    Don't you write a before image for consumed (deleted) messages and an after image for produced messages?
    \Andreas Mueller\

    I do writes to the log for non-XA JMS activity that happens within an XA transaction. But these writes do not have to be forced.

    Exactly what I'm writing and how this particular data is stored is highly specific to my own JMS implementation, and doesn't apply to generalized XA stuff - this is a fancy way of saying that my employer considers this non-public information and will whip me with leather tongs if I go around telling people about it :-).

         -Mike
  69. What is the Bottleneck in XA?[ Go to top ]

    Exactly what I'm writing and how this particular data is stored is highly specific to my own JMS implementation, and doesn't apply to generalized XA stuff - this is a fancy way of saying that my employer considers this non-public information and will whip me with leather tongs if I go around telling people about it :-).

    >
    > -Mike

    What is it you build, and for whom? If Cameron is guilty of over promotion ( :-) I liked reading it Cameron ) then you're guilty of under promotion...
  70. What is the Bottleneck in XA?[ Go to top ]

    I'm a full time employee for a "major financial services firm" which is a jointly owned subsidiary of two other "major financial services firms" in the "New York Metropolitan area" doing "vague stuff with computers".

    By my firm's standards, the above blurb would consist of blatant over promotion :-/

    Slightly closer to reality would be to say that I specialized in writing highly distributed, high availability, high throughput middleware, mostly in Java but with alot of C/C++ thrown in. This stuff is proprietary to the company, but gets used by many applications which in turn are used by many traders of various sorts of financial instruments. One piece of this is a JMS implementation which is a bastard child of OpenJMS, an in-house messaging system based on C/C++, a commercial messaging system, and lots of my own Java code thrown in for good measure. The company itself is, ah, not your typical financial services/brokerage/bank, but is a house hold name (well, our parent is).

         -Mike
  71. What is the Bottleneck in XA?[ Go to top ]

    [Mike Spille's Log Impl...]

    Thanks. For those who are interested, a deeper description and discussion of such kind of write-ahead-log impl can be found in Principles of Transaction Processing. They use the same terms so I guess Mike knows it as well... ;-)

    Concerning the use of NIO for the tx log:

    A RandomAccessFile.write(byte[]) isn't that much slower than to use memory mapped files in NIO (as far as I remember from my tests, NIO was actually slower, but I would have to test that again). As you've already confirmed, the time is consumed in the sync and not in the write. Backdraw with NIO is that you have to use 1.4 or later (which might be ok in such a corporate-only usage like yours) and that you have to use a fixed size file so you have to maintain the virtual length by yourself, e.g. write a EOF token (which in turn has the danger that the token is contained in the [variable data] of your log records). Advantage maybe is that you can use continuous space.
  72. What is the Bottleneck in XA?[ Go to top ]

    \Andreas Mueller\
    Thanks. For those who are interested, a deeper description and discussion of such kind of write-ahead-log impl can be found in Principles of Transaction Processing. They use the same terms so I guess Mike knows it as well... ;-)
    \Andreas Mueller\

    I'm somewhat familiar with it. Actually, most of my transaction processing knowledge and implementation techniques come from hunting down information via Google. There's a wealth of research information out there from universities, deep pocket research organizations like IBM, etc. In particular, Ramesh Kumar Gupta's master's thesis "Commit Processing in Distributed On-Line and Real-Time Transaction Processing Systems" from '97 is an excellent overview of techniques like 2PC and Presumed Abort (I haven't been able to find this recently via Google...). Plus IBM's various R* research papers, where 2PC and its variants really began.

    \Andreas Mueller\
    Concerning the use of NIO for the tx log:

    A RandomAccessFile.write(byte[]) isn't that much slower than to use memory mapped files in NIO (as far as I remember from my tests, NIO was actually slower, but I would have to test that again).
    \Andreas Mueller\

    Oh, yuck, I don't use memory mapped files for my I/O - sorry if I somehow implied that. They're definitely slow, and they eat memory like crazy w/ no way to forcibly get rid of it. No, I use FileChannel and direct buffers for my transaction log I/O - this is quick, and the NIO buffer concept fits very well w/ how I organized my transation log I/O logic.

    \Andreas Mueller\
    Backdraw with NIO is that you have to use 1.4 or later (which might be ok in such a corporate-only usage like yours) and that you have to use a fixed size file so you have to maintain the virtual length by yourself, e.g. write a EOF token (which in turn has the danger that the token is contained in the [variable data] of your log records). Advantage maybe is that you can use continuous space.
    \Andreas Mueller\

    Yes, since I'm corporate I was able to mandate Java 1.4 on the server side, which has saved me enormous amounts of grief. Sadly, I have to support Java 1.1 on the client side (1 user running HP-UX 10 somewhere, and I'm sunk).

    I've had no problems with writing an EOF token being dangerous. All you have to do is design your log format so that corruption is easily detected, with the ideas that corruption implies server failure which gives you an implicit EOF w/in the transaction log. You pay a startup price in having to read through the logs, but this is heavily outweighed by the high speed achieved by a write-forward-only strategy. I'm willing to pay a small startup cost and a bit more code complexity to get the benefits of write-forward w/out seeking. In any case, _any_ transaction log system has to deal with possible corruption intelligently.

         -Mike
  73. Gupta Paper....[ Go to top ]

    Ah, here we go:

        http://citeseer.nj.nec.com/gupta97commit.html

    Click on "PDF" under "View or download" in the upper right hand corner of the page. Of particular interest to this discussion are pp 22-40, which cover distributed transactions, 2PC, etc. It's a very nice abstracted view of 2PC without getting bogged down into the APIs of X/Open or J2EE.

        -Mike
  74. What is the Bottleneck in XA?[ Go to top ]

    Oh, yuck, I don't use memory mapped files for my I/O - sorry if I somehow implied that. They're definitely slow, and they eat memory like crazy w/ no way to forcibly get rid of it. No, I use FileChannel and direct buffers for my transaction log I/O - this is quick, and the NIO buffer concept fits very well w/ how I organized my transation log I/O logic.

    >

    Ok. But then you don't need fixed length files either.

    > ...You pay a startup price in having to read through the logs, but this is heavily outweighed by the high speed achieved by a write-forward-only strategy. I'm willing to pay a small startup cost and a bit more code complexity to get the benefits of write-forward w/out seeking. In any case, _any_ transaction log system has to deal with possible corruption intelligently.
    >

    Sure. You could write a checkpoint (sync the tx log with your actual data) on a regular shutdown and then, on startup, you only need to go through the recovery procedure if the size of the tx log is > 0.

    Anyway...
  75. Transaction Log Impementations[ Go to top ]

    \Andreas Mueller\
    Ok. But then you don't need fixed length files either.
    \Andreas Mueller\

    Whether you go fixed length or not really depends on the nature of what you're storing in the logs. For my own usage, and for something like a Transaction Manager, fixed length is probably the best overall, because you can constrain the maximum record size (variable length does not necessarily mean unbounded). For something like, say, an RDBMS transaction log a variable length log w/ checkpointing may make more sense.

    \Andreas Mueller\
    Sure. You could write a checkpoint (sync the tx log with your actual data) on a regular shutdown and then, on startup, you only need to go through the recovery procedure if the size of the tx log is > 0
    \Andreas Mueller\

    The regular shutdown case is uninteresting, at least to me. What's interesting to me is:

       - Keep streaming data to the logs at the best rate I can (possibly with throttling)
       - Upholding the XA protocol guarantees - a positive return from prepare() and commit() should provide absolute durability
       - Correctly dealing with an abnormal server process shutdown, potentially right when I'm in the middle of a write
       - Minimizing spikes in response times.

    When you want absolute correctness and going as fast as possible and avoiding certain threads paying a big response time hits from spikes (from something like a checkpoint), _and_ being able to deal with the server getting blown out of the air at any possible time, then your solutions get somewhat restrained. The one advantage I have is the log record size constraint, which I share with J2EE TMs (at least conceptually - a TM won't necessarily write only the minimum amount of data it needs in its XA transaction logs :-/).

    In particular, checkpoints don't work well for me, because they often have unacceptable spikes while they're checkpointing. At the same time, checkpoints tend to violate XA guarantees, plus I gotta worry about what happens if I fail in the middle of checkpointing (argh!).

    Admittedly the problem is much more difficult if you're dealing with an RDBMS style tranaction log, and you may be forced to adopt a checkpoint system which can lose some data which was supposedly "durable" - which is why I'm thankful I don't have to write one as part of my job.

        -Mike
  76. Transaction Log Impementations[ Go to top ]

    In particular, checkpoints don't work well for me, because they often have unacceptable spikes while they're checkpointing.


    So which spikes? Do you mean this single disk sync?

    > checkpoints tend to violate XA guarantees,

    Huh?

    > plus I gotta worry about what happens if I fail in the middle of checkpointing (argh!).
    >

    At least you shouldn't worry anymore when you have finished your impl...
  77. java.io vs. NIO[ Go to top ]

    Andreas: A RandomAccessFile.write(byte[]) isn't that much slower than to use memory mapped files in NIO (as far as I remember from my tests, NIO was actually slower, but I would have to test that again).

    I've seen the same thing (NIO has slight overhead compared to the original IO methods.)

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  78. What is the Bottleneck in XA?[ Go to top ]

    [..] the engine sucks a lot of gas and responds very slowly (if it is recoverable) because it has to ensure that certain steps in the protocol are logged in a recoverable manner (i.e. forced to disk.)

    >
    >

    Just to underscore Mark's point: if the engine isn't recoverable, it's not doing transactions.
  79. <Cameron>
    I'm not an expert on XA, but I did author an entire OTS/XA+ TM with single- and 2PC commit options (including full nested transaction support as provided for by OTS) and optimizations for resource ordering particularly for expected read-only resources etc.
    </Cameron>

    :-) I think that qualifies you as an expert...
  80. When there is only one resource, that resource can shoulder the transactional requirements, and the application server needs only indicate boundaries, and may even do so using single-PC processing (i.e. begin followed by either commit or rollback.)

    >

    Yes, but that's just typical one-phase commit optimisation. You'll find that outside of XA too. So were you compare 2PC versus 1PC? If so, that's not really a fair comparison.

    > In other words, the application server doesn't need an "engine" if it's not using XA -- all it needs is a gas pedal (commit) and a brake (rollback). Sorry for the crappy analogy ;-)
    >
    > When more than one resource is employed within a transaction, and the resources need to have their states coordinated, then XA can be used. Now the application server does need its own "engine" to use your terms, and the engine sucks a lot of gas and responds very slowly (if it is recoverable) because it has to ensure that certain steps in the protocol are logged in a recoverable manner (i.e. forced to disk.)
    >
    > If the prepares and commits are both forced, then the overhead of 2PC XA to single-PC non-XA is at least 4:1, and in some implementations is 6:1 (intent, resource log, result, command, resource log, result.)

    Again, it seems like you're comparing one-phase commit optimisation with a multi-phase (2 in this case) protocol. Fairly obviously the latter is going to perform slower than the former, even if no disk I/O is needed - the message cost will add an overhead anyway. Why didn't you use the one-phase commit optimisation in XA? I'm just interested in seeing where your overhead was really coming from - and crappy implementations of XA is definitely one possibility, but I'd like to separate implementation from protocol.

    > In both cases, I'm assuming that the resource will always force its log, just to be fair (b/c if it doesn't force its log, then all other recoverability bets are off.) And those ratios are for the efficient implementations that batch and parallelize their resource operations -- it could be much worse for 2PC XA if resources are not invoked in parallel and if logs are not segment-batched. Add the additional network traffic and associated latencies, and it's quite easy to get close to a 10:1 performance difference. However, I think that a good chunk of the latency is just due to poorly written resource drivers and even more poorly tuned XA implementations.
    >

    Quite possibly true. I've come across some dog-poor implementations from some of the "heavy hitters" in the industry. My questions, however, are more protocol related at this point.

    > I'm not an expert on XA, but I did author an entire OTS/XA+ TM with single- and 2PC commit options (including full nested transaction support as provided for by OTS) and optimizations for resource ordering particularly for expected read-only resources etc.
    >

    Haven't we all?

    All the best,

    Mark.

    > Peace,
    >
    > Cameron Purdy
    > Tangosol, Inc.
    > Coherence: Clustered JCache for Grid Computing!
  81. I've never seen a case where it didn't adversely impact performance in a big way.


    So be it. Is there any better alternative implementation for 2PC to XA (or for that matter, on alternative concept to 2PC)?

    Regards,
    Horia
  82. I've never seen a case where it didn't adversely impact performance in a big way.

    >
    > So be it. Is there any better alternative implementation for 2PC to XA (or for that matter, on alternative concept to 2PC)?

    Microsoft has a proprietary implementation of 2pc called OLE Transactions (OLE TX) that is supported by SQL Server. I am not sure how many other RM's support that but in case a RM wants to enlist in a transaction initiated by a OLE TX RM, there exists a OLETX=>XA mapper built into the Distributed Transaction Coordinator (DTC).
  83. The Insidious Evil of Heuristics[ Go to top ]

    \Horia Muntean\
    Ok, but this can raise another issue: what to do in a "cannot commit / cannot rollbak / heuristic rollbak" scenarios? Stop the system? Initiate a automatic online/offline recovery? Recovery with compensation tx?
    \Horia Muntean\

    The short answer: you're screwed.

    The medium answer: do everything you can to avoid this. If you can't, you're screwed.

    The long answer: If one or more resources can't commit or rollback, this _should_ be due to a castrophic error in that resource. The best you can do in this case is throw the Tx into a background thread, and keep retrying it on a regular basis. Alternatively, throw it up on a monitoring screen, and let an operator hit a retry button on a regular basis :-) What a TM should do in these scenarios is make the decision correctly (rollback or commit), and propogate this decision out to the XAResources. Even if one (or more - saints preserve us!) resource is reporting failure, you complete the remaining resources as your initial decision indicated, and retry on the failed one(s) until it works.

    For Heuristic decisions: the TM is often royally screwed. If you're lucking, the Heueristic decision may match what the TM wants to do. In that case, no problem. If the Heueristic decision does _not_ match the TM, you are screwed. The TM may have already commited a Resource, and some joker in the deck has done a Heuristic commit. There really isn't anything the TM can do here beyond flashing big red lights and wailing a siren to alert the system operators that the system has gone into an inconsistent state.

    As an aside, I'd like to track down the individual who coined the phrase "heueristic decision". It's an awful techno-elite comp-sci snobbery phrase that confuses non-2PC experts to no end, and it's terrible that the standards keep propogating this terminology.

    For those not in the know, a "heuristic decision" means some joker did something to an XAResource outside of the TM. Usually, someone did a commit() or rollback() directly on the resource w/out the TM being involved, and this is called a Heuristic Commit, or a Heueristic Rollback. "Heuristic" in this context most often means that someone tried a manual recovery, and "Heuristic" really means "I guessed". You'd never know it from the standards documents, but a heuristic commit usually means some joker logged into the RDBMS (or whatever),a nd typed "commit XYZ". A Heuristic rollback usually means some joker logged into the RDBMS (or whatever), and typed "rollback XYZ". The standards docs like the X/Open spec make Heuristic decisions sound like some neural net processor analyzed some complex tranaction flow and made some sort of super-genius optimization and pre-empted the TM - but in the cases I've seen, heueristic-X means a human directed the resource to rollback or commit directly.

    I don't know if there are any systems out there where heuristic decisions are made unilaterally by the resource itself automatically. I certainly hope there aren't, but one never knows...

         -Mike
  84. The Insidious Evil of Heuristics[ Go to top ]

    Heuristic intervention is a pragmatic mechanism that allows systems to free resources. As Mark pointed out, all of these termination protocols that aim to help provide ACID state changes are blocking. Some systems have to be available, eg, medical databases, stock exchanges, etc. This is one of those religious subjects that can quickly polarize a room of transaction architects, I'll just point out that it's not a bad thing to have a worse case scenario mechanism to bail on the transaction. But you have to bear in mind that eventually things need to be made right. That can be the tricky part.

    BTW, you may have noticed that the TX and/or XA specs do talk about resources autonomously making heuristic decisions in response to a timeout condition. It's a solution that may be required for some kinds of systems. Your typical case, though, involves manual intervention and off-line techniques -- basically what you described.

    Transactions are one of the things that J2EE (especially EJB) helps with quite a bit, but it seems to me that the ease with which they are useable obscures the need for developers and architects to really understand what the heck is going on under the covers. At least that's been my experience.

    Greg
  85. The Insidious Evil of Heuristics[ Go to top ]

    \Greg Pavlik\
    Heuristic intervention is a pragmatic mechanism that allows systems to free resources. As Mark pointed out, all of these termination protocols that aim to help provide ACID state changes are blocking. Some systems have to be available, eg, medical databases, stock exchanges, etc. This is one of those religious subjects that can quickly polarize a room of transaction architects, I'll just point out that it's not a bad thing to have a worse case scenario mechanism to bail on the transaction. But you have to bear in mind that eventually things need to be made right. That can be the tricky part.
    \Greg Pavlik\

    Yes, very true, and thanks for injecting some real scenarios into the discussion to clarify. I understand there are reasons for heuristics, but that doesn't mean I have to like them :-/ Also, discussion of them is rather vague and mysterious in the X/Open standards, and J2EE _greatly_ worsens this situation by hardly mentioning them at all - you are forced to often go back to X/Open and interpret what the X/Open spec means in a JTA context, which sometimes can be a difficult exercise. Perhaps if heuristics were explained a bit more clearly in the specs they could be (cautiously!) used in a more correct manner. On issues like this, I often yearn for more specs to have the equivalent of old ANSI C's rationale - whether it was official or not, the rationale gave some desperately needed insight into the "why's" of the spec instead of just giving people the "how" and letting them burn neurons trying to infer the reasoning behind them.

    \Greg Pavlik\
    BTW, you may have noticed that the TX and/or XA specs do talk about resources autonomously making heuristic decisions in response to a timeout condition. It's a solution that may be required for some kinds of systems. Your typical case, though, involves manual intervention and off-line techniques -- basically what you described.
    \Greg Pavlik\

    I have no problem with a heuristic rollback due to a timeout at prepare() time. This is entirely proper to make sure that a slow or stuck TM doesn't unnecessarily keep lots of locked resources in something like an RDBMS. I have trouble envisioning it happening automatically after prepare() has been called though - any heuristic popping out of a commit() call is invariably going to give the TM fits.

    \Greg Pavlik\
    Transactions are one of the things that J2EE (especially EJB) helps with quite a bit, but it seems to me that the ease with which they are useable obscures the need for developers and architects to really understand what the heck is going on under the covers. At least that's been my experience.
    \Greg Pavlik\

    I agree completely on both points. That's one of the reasons for me writing the article, to try to make understanding of XA more approachable to application developers. The underlying specs J2EE uses are just too dense and impenetrable for many developers, and many explanations I've seen on the net seem to gloss over too many important details.

        -Mike
  86. The Insidious Evil of Heuristics[ Go to top ]

    Well, bear in mind that JTA was written a long time ago when J2EE was just an idea and the JCP didn't exist. It's a mapping of the XA and TX APIs into Java, and it really can't be understood without those specs.

    Greg
  87. The Insidious Evil of Heuristics[ Go to top ]

    Yes, very true, and thanks for injecting some real scenarios into the discussion to clarify. I understand there are reasons for heuristics, but that doesn't mean I have to like them :-/ Also, discussion of them is rather vague and mysterious in the X/Open standards, and J2EE _greatly_ worsens this situation by hardly mentioning them at all - you are forced to often go back to X/Open and interpret what the X/Open spec means in a JTA context, which sometimes can be a difficult exercise. Perhaps if heuristics were explained a bit more clearly in the specs they could be (cautiously!) used in a more correct manner. On issues like this, I often yearn for more specs to have the equivalent of old ANSI C's rationale - whether it was official or not, the rationale gave some desperately needed insight into the "why's" of the spec instead of just giving people the "how" and letting them burn neurons trying to infer the reasoning behind them.

    >

    I agree that the JTA spec. is a mess. It doesn't say enough about important things and assumes too much information on the reader. A tip though: if in doubt and you don't have the money to shell out for the X/Open specs., get the OTS specification (from the OMG) and check through that. The OTS isn't free of errors, but it is a lot better written and assumes a lot less on the reader. It's also free! And because JTA is supposed to be able to sit on top of OTS, it helps to resolve some of the ambiguous text.

    Mark.
  88. \Mark Little\
    It depends on the size of the log, which depends on lots of factors such as what's saved in the log (before/after images?, do/undo operations? participant references? etc.) and the size of the stable-storage block.

    There's also the effect of whether presumed-abort/presumed-nothing/presumed-commit is used by the transaction service to consider.
    \Mark Little\

    Whoa there, you're mixing in alot of what-if's that aren't applicable to the specific case of XA J2EE. First, "size, before/after images, do/undo operations, participant references...size of stable-storage" are all RDBMS checkpoint-like things, which have nothing to do with XA - you pay those costs one way or another.

    For XA, the added-extra is durably recording the Xid info at various stages - this is a wholly seperate problem from RDBMS logs. Here are some examples:

      - At the TM side, all the TM needs to durably record is the global tran ID in a "Committing..." record. That's it. You can do this with 2 fixed sized logs, with say 500 bytes per "Committing..." record. You fill up one log, then "flip" to the other when your current one is full. Here, log size and everything else doesn't matter - all that matters really is how fast your disk force is, and whether or not you batch disk force requests.

      - JMS is equally easy. Fundamentally, you need to store published messages somewhere, then disk force prepare() and commit() records. You can use the 2 fixed sized log approach w/ flipping here just as with the TM case.

      - RDBMS' are more complicated :-)

    On the presumed this, that, or the other, this is already decided for you: J2EE is presumed abort.

    \Mark Little\
    If the TM doesn't do logging then it can't be said to be conforming to the ACID properties. The Atomicity part of that means that you get all or nothing semantics irrespective of failures (ignoring catastrophic disk failures, but you could always replicate to improve failure resiliency anyway). [...]
    \Mark Little\

    Exactly right.

        -Mike
  89. \Mark Little\

    > It depends on the size of the log, which depends on lots of factors such as what's saved in the log (before/after images?, do/undo operations? participant references? etc.) and the size of the stable-storage block.
    >
    > There's also the effect of whether presumed-abort/presumed-nothing/presumed-commit is used by the transaction service to consider.
    > \Mark Little\
    >
    > Whoa there, you're mixing in alot of what-if's that aren't applicable to the specific case of XA J2EE. First, "size, before/after images, do/undo operations, participant references...size of stable-storage" are all RDBMS checkpoint-like things, which have nothing to do with XA - you pay those costs one way or another.
    >

    Not true. XA does not define how the coordinator is implemented - it's a protocol specification afterall. All of these things are valid implementations of two-phase commit transactions which date back decades. Most of the pedigree transaction system implementations pre-date XA by many years.

    It is true that *some* of these things are RDBMS specific (let's call them participants to get away from XA specifics here). But some of them aren't.

    > For XA, the added-extra is durably recording the Xid info at various stages - this is a wholly seperate problem from RDBMS logs. Here are some examples:
    >
    >   - At the TM side, all the TM needs to durably record is the global tran ID in a "Committing..." record. That's it. You can do this with 2 fixed sized logs, with say 500 bytes per "Committing..." record. You fill up one log, then "flip" to the other when your current one is full. Here, log size and everything else doesn't matter - all that matters really is how fast your disk force is, and whether or not you batch disk force requests.
    >

    This won't help you in a distributed XA setting and only works if you only support bottom-up (participant-to-coordinator) recovery. It's true that the minimum information you need to save at the coordinator is the tranId and the outcome (which is null if you're aborting anyway), but if you want distributed participants to be able to locate the remote coordinator then they need to record more information than they typically get via the XA calls into their logs (which some implementations do) *or* you support bottom-down (coordinator-to-participant) recovery, where the coordinator maintains participant information in order to enact recovery. All of this is in line with what I originally said though - it does depend on what you want to save in the log and that depends on what your deployment scenarios are. Limit them and you obviously limit what needs to be saved.

    >   - JMS is equally easy. Fundamentally, you need to store published messages somewhere, then disk force prepare() and commit() records. You can use the 2 fixed sized log approach w/ flipping here just as with the TM case.
    >
    >   - RDBMS' are more complicated :-)

    Yes, especially in a distributed environment where you support interposition!

    >
    > On the presumed this, that, or the other, this is already decided for you: J2EE is presumed abort.
    >

    Agreed. That was a general TP comment.

    > \Mark Little\
    > If the TM doesn't do logging then it can't be said to be conforming to the ACID properties. The Atomicity part of that means that you get all or nothing semantics irrespective of failures (ignoring catastrophic disk failures, but you could always replicate to improve failure resiliency anyway). [...]
    > \Mark Little\
    >
    > Exactly right.

    Mark.
  90. \Horia Muntean\
    "Disk forcing" occurs at resource managers that are tx aware anyway. Are you sure that one more "disk forcing" at TM is such a performance killer?
    \Horia Muntean\

    Look at a plain old RDBMS transaction (PORT?) vs. an XA transaction with two resources.

    In the PORT, we can get away with 1 disk force (or maybe a fancy checkpointing system if that's your thing).

    In the XA example scenario we have:

       - 2 disk forces at prepare(), that's 1 per XAResource
       - 1 disk force by TM to signal "Committing..."
       - 2 disk forces at commit(), that's 1 per XAResource

    That's 5 disk forces vs. 1 disk force, and disk forces tend to be expensive. It may turn out that you're doing big SQL work somewhere that takes 500 milliseconds or something, in which case the disk forcing will be lost as noise by the actual transactional work (who cares about 10 or 20 milliseconds for forcing when you've got a 500 milli query?).

    But if your doing small queries, the disk forces can stand out as the _majority_ of your transaction time. I've seen many systems doing XA which do a quick select, an update or two, and a few JMS publishes. Overall time for the "work" here might be only 50 milliseconds. Where as the disk forcing might eat 100 milliseconds.

    \Horia Muntean\
    One could use a TM that doesn't do tx log or automatic recovery (JBOSS or JOTM). Manually resolving in doubt tx over resource managers could be done with no hassle considering real scenario crashes.
    \Horia Muntean\

    In systems with low loads, where there's almost never a transaction in flight (releatively speaking), you're right. You might at most have to manually recover a transaction or two.

    If you're talking about a high volume system, manual recovery can be a nightmare. Getting the recovery right, so all resources stay in sync (either all rolled back or all commited, per transaction in flight), can be devilishly difficult, and it's easy to screw it up and force your system into an inconsistent state. Plus manually figuring it all out might be very time consuming - and while you're manually figuring it out, one of the in-doubts in your RDBMS may just be holding a lock and blocking other transactions from happening in the system.

    Some people say this is a judgement call - how much safety do you want vs. how much of a performance hit. Personally, I take a rather hard line here. I say: if you go XA, go for full safety. If you don't want full safety, don't use XA. Partial XA solutions, like using a TM with no log or recovery facilities, are IMHO timebombs waiting to happen. All you need is for an operator to accidentally kill your JBoss instance (oh shit, he said "kill -9 12331", not "kill -9 12113"!), and your system state can be screwed up massively with no automated way to fix it.

    The way I see it, if you think you don't need a transaction log, then just abandon XA altogether. Better to not have the security, and _know_ that, then kid yourself that you're kinda safe. Because kinda-XA systems really are time bombs. Kill off that JBoss process, and you can end up blocking up a big chunk of your RDBMS with just one transaction stuck in prepare(). If you're not doing XA, this can never happen, 'cuz there is no such thing as an in-doubt transaction if you're not doing XA.

        -Mike
  91. <Mike Spille>
    Look at a plain old RDBMS transaction (PORT?) vs. an XA transaction with two resources.

    In the PORT, we can get away with 1 disk force (or maybe a fancy checkpointing system if that's your thing).

    In the XA example scenario we have:
     
        - 2 disk forces at prepare(), that's 1 per XAResource
        - 1 disk force by TM to signal "Committing..."
        - 2 disk forces at commit(), that's 1 per XAResource
     
    That's 5 disk forces vs. 1 disk force, and disk forces tend to be expensive.
    </Mike Spille>

    You are not fair here. :)

    Why not consider XA with 2 resources and non-XA with 2 resources also. That yields a 5:2 disk force ratio.

    Or (non sense here, but still) 1 XA resource with 1 non-XA resource. This makes 3:1.

    And as a side note, XA has nothing to do with J2EE. I mean, the pifalls of XA were not intoduced by J2EE. I am saying this 'cause a have a vague impression that you kind of blame J2EE for those "disk forces".

    Regards,
    Horia
  92. <Mike Spille>

    > Look at a plain old RDBMS transaction (PORT?) vs. an XA transaction with two resources.
    >
    > In the PORT, we can get away with 1 disk force (or maybe a fancy checkpointing system if that's your thing).
    >
    > In the XA example scenario we have:
    >
    > - 2 disk forces at prepare(), that's 1 per XAResource
    > - 1 disk force by TM to signal "Committing..."
    > - 2 disk forces at commit(), that's 1 per XAResource
    >
    > That's 5 disk forces vs. 1 disk force, and disk forces tend to be expensive.
    > </Mike Spille>
    >
    > You are not fair here. :)
    >
    > Why not consider XA with 2 resources and non-XA with 2 resources also. That yields a 5:2 disk force ratio.
    >
    > Or (non sense here, but still) 1 XA resource with 1 non-XA resource. This makes 3:1.
    >
    > And as a side note, XA has nothing to do with J2EE. I mean, the pifalls of XA were not intoduced by J2EE. I am saying this 'cause a have a vague impression that you kind of blame J2EE for those "disk forces".
    >
    > Regards,
    > Horia

    And more on ratio calculus, (2PC-08/2PC-09) and (2PC11/2PC12 ) could be done almost simultaneously using async calls (I am not aware of any specific implementation, it's just an opinion) so we can have a 3:2 ratio.

    Regards,
    Horia
  93. \Horia Muntean\
    You are not fair here. :)

    Why not consider XA with 2 resources and non-XA with 2 resources also. That yields a 5:2 disk force ratio.

    Or (non sense here, but still) 1 XA resource with 1 non-XA resource. This makes 3:1.
    \Horia Muntean\

    Trying to be fair seems to always get me into trouble in the end. A colleague of mine Charlie is fond of saying "No good deed goes unpunished", and sadly that seems to be true.

    Back to XA...yes, strictly speaking you'll see something like this for XA vs. non:

       1 Resource XA: 1 disk force (the 1PC optimization)
       1 Resource Non: 1 disk force
       2 Resources XA: 5 disk forces
       2 Resources Non: 2 disk forces
       3 Resources XA: 7 disk forces
       3 Resources Non: 3 disk forces

    etc. etc.

    Incidentally, "1 XA Resource" isn't nonsense. Here's some scenarios:

       - The TM may not know up front how many resources are going to do work, and I don't believe there's any way to "upgrade" a transaction from local to XA. So the TM may start out as XA, and then use the 1PC optimization if it turns out only 1 resource does any work.

       - A given resource may end up not actually changing anything in a transaction. For example, a stateless session bean may do a JMS publish all the time, and then either do a select and update, or just a select, depending on the incoming data. So we've got two XAResources, one for the JMS side and one for the RDBMS. The JMS guy may always publish, but the RDBMS may change the RDBMS state sometimes and not change it other times. In the former case, we have normal 2PC. In the latter case, the RDBMS XAResource can be clever and notice that his transactional work didn't actually make any changes. In this situation the XAResource can respond "READ_ONLY" to a prepare call. If a resource does this, the TM can drop that resource out of the 2PC chain altogether (e.g. not call commit on it). In this scenario, if the RDBMS responds READ_ONLY and is hit first in the 2PC chain, the TM could legally jump to the JMS XAResource commit with the 1PC optimization.

    On the 1PC optimization: if it turns out one way or another that you have only 1 XAResource after all the normal transaction work is done, the TM can take advantage of the 1PC optimization. What happens here is that the TM skips the prepare() call altogether, and calls commit directly on the 1 XAresource. If you look at the signature for XAResource.commit(), it's:

        public void commit(Xid xid, boolean onePhase)

    That onePhase boolean is there for the 1PC optimization. If there's only 1 resource, the TM skips prepare, and goes right to commit() with onePhase set to true. In this instance, there's only 1 disk force, because there's no prepare() calls, and the TM doesn't need to disk force anything itself either. You can think of this as the TM down-grading the transaction to a local tran.

    \Horia Muntean\
    And as a side note, XA has nothing to do with J2EE. I mean, the pifalls of XA were not intoduced by J2EE. I am saying this 'cause a have a vague impression that you kind of blame J2EE for those "disk forces".
    \Horia Muntean\

    I completely agree, and I'm sorry if I gave the impression that I was targetting just J2EE for this behavior. It is a structural requirement of any 2PC system, J2EE or not.

    I do blame J2EE for some other wrongs it did in the 2PC landscape, though. In particular, J2EE dropped explicit async XAResources (argh!), and dropped the ability for RM's to query into TM's.

    The former decision makes it harder for the TM to try to amortize the cost of 2PC disk forces by hitting XAResources in sync - you can do it, but you have to spin up extra threads to do so. In the original X/Open spec you could hit XAResource in parallel directly from one thread, and reap the responses from that same thread.

    The latter decision may make some sense, since structurally J2EE and Java are quite different from X/Open's C specification. But still, by not giving a way for RM's to query into TM's, XA is a bit more fragile in that RM's never know what's going on that the TM doesn't tell it. (Oh, "RM" here basically means an XAResource).

       -Mike
  94. <Mike Spille>
    Trying to be fair seems to always get me into trouble in the end. A colleague of mine Charlie is fond of saying "No good deed goes unpunished", and sadly that seems to be true.
    </Mike Spille>

    I was not debating the content of your article. It's a good one: accuate and usefull. I was tring to get some numbers about the real performace drain XA can cause.

    <Mike Spille>
    I do blame J2EE for some other wrongs it did in the 2PC landscape, though. In particular, J2EE dropped explicit async XAResources (argh!), and dropped the ability for RM's to query into TM's.
     
    The former decision makes it harder for the TM to try to amortize the cost of 2PC disk forces by hitting XAResources in sync - you can do it, but you have to spin up extra threads to do so. In the original X/Open spec you could hit XAResource in parallel directly from one thread, and reap the responses from that same thread.
     
    The latter decision may make some sense, since structurally J2EE and Java are quite different from X/Open's C specification. But still, by not giving a way for RM's to query into TM's, XA is a bit more fragile in that RM's never know what's going on that the TM doesn't tell it. (Oh, "RM" here basically means an XAResource).
    </Mike Spille>

    Really? Can you tell me what exactly what spec is this?

    Regards,
    Horia
  95. X/Open specification[ Go to top ]

    http://www.opengroup.org/products/publications/catalog/c193.htm

    " Distributed TP: The XA Specification

    In recognition of the growing requirement for Distributed Transaction Processing (DTP), The Open Group defined a model for DTP. This model envisions three software components in a DTP system and this specification defines the interface between two of them, the transaction manager and local resource manager. An XA+ Specification, Version 2 (S423) is a superset of XA which includes the communications resource manager. We recommend that both documents be examined and compared."

    It's a really fun, light read for when you find yourself with some free time :-)

        -Mike
  96. X/Open specification[ Go to top ]

    <Mike Spille>
    I do blame J2EE for some other wrongs it did in the 2PC landscape, though. In particular, J2EE dropped explicit async XAResources (argh!), and dropped the ability for RM's to query into TM's.
    </Mike Spille>

    This is the paragraph I was refering to. Where in J2EE specs does it state that a TM can't communicate async with RMs?

    Regards,
    Horia
  97. X/Open specification[ Go to top ]

    \Horia Muntean\
    Where in J2EE specs does it state that a TM can't communicate async with RMs?
    \Horia Muntean\

    It's not that J2EE explicitly bans async communication, it's more that J2EE doesn't facilitate it.

    In the X/Open spec, all the major functions have a flag, TMASYNC, which indicates that the resource should perform this operation asynchronously. The TM then uses xa_complete() to reap the results.

    The JTA spec has the following to say on the omission of explicit async support:

    "Asynchronous operations are not supported. Java supports multi-threaded
    processing and most databases do not support asynchronous operations"

    The above doesn't make alot of sense to me, but the end result is that it means in practice all a TM can do is spin off individual threads per XAResource and block on them, and use a coordinating thread (probably the Txn originating thread) to wait on these blocked threads. Someone from BEA claims WebLogic does this, but I haven't seen a confirmation of that, and the approach can eat alot of threads. No other J2EE TM I know of claims to even attempt doing async invocations.

        -Mike
  98. relaxing ACID?[ Go to top ]

    Very interesting article and -discussion.
    Oracle's Larry Jacobs advised in his techtalk using transactional messaging instead of 2pc when remote resouce managers are involved precisely because of the log-flush/ disk-contention mentioned by Mike.

    Almost all uses of messaging that I encountered where in a 'wait for the response' fashion. Usually I implemented this with a timeout that resulted in a 'what you asked will be done, but not now' message and a callback-like application flow when the result becomes available, usually involving an email message.
    I've got the feeling that if you want to join the remote message consumer in the transaction that you are trying to shoehorn the semantics of call&return on fire-and-forget. In other words that the confusion arises out of muddling these semantics, wanting the best of complementary worlds. But that's just a hunch.

    A related topic that I haven't investigated is an approach where each quality of ACID (Atomicity, Consistency etc) can be relaxed individually on a transaction implementation. I guess this would involve some inferencing to make sure what qualities can be guaranteed on the level of the encompassing, multi-resource manager transaction. Maybe this is a topic that is still in the shadows of the academic world. Has anybody more information on this?

    By the way since we seem to be in the benign theme of finding in Mikes article subliminal easter eggs disguised as spelling mistakes I'd like to point one out: although I'm not a native english speaker I'm quite convinced that 'that adheres to ACID principals' smacks more of cynical headmasters then of transactions. Shouldn't that be 'adheres to ACID principles'? :-)

    Great article, looking forward to sequels.
    Groeten uit Nederland,
    Joost
  99. relaxing ACID?[ Go to top ]

    Perhaps JAXTX (JSR 156) or BTP from Oasis? This is more for web services or txns across the internet but the idea seems to be the same : relaxation of some ACID properties to enable transactions in a loosely coupled environment.
  100. relaxing ACID?[ Go to top ]

    oops..make that JSR 95 based on "OMG's Activity Service specification". BEA seems to have voted against 156 due to existence of 95.
  101. relaxing ACID?[ Go to top ]

    oops..make that JSR 95 based on "OMG's Activity Service specification". BEA seems to have voted against 156 due to existence of 95.


    Just a quick correction: JSR 156 (JAXTX) is in existence despite BEAs no vote (the only no vote).

    Mark.
  102. relaxing ACID?[ Go to top ]

    oops..make that JSR 95 based on "OMG's Activity Service specification". BEA seems to have voted against 156 due to existence of 95.

    >
    > Just a quick correction: JSR 156 (JAXTX) is in existence despite BEAs no vote (the only no vote).
    >
    > Mark.

    Oh, and they didn't vote no because of JSR 95 (which they also didn't agree with and didn't work on). They voted no because at the time they believed that it wouldn't be possible to standardise on an API for Web Services transactions.

    Mark.
  103. relaxing ACID?[ Go to top ]

    \joost de vries\
    Oracle's Larry Jacobs advised in his techtalk using transactional messaging instead of 2pc when remote resouce managers are involved precisely because of the log-flush/ disk-contention mentioned by Mike.
    \joost de vries\

    Well, alot of that interview dealt with Oracle being able to do both messaging and RDBMS stuff all within Oracle. Since both sides are using Oracle's underlying transaction manager, it can optimize everything as a local transaction instead of having to use 2PC.

    This is a nice optimization, but it relies on 2 resources being intimately tied to one another. If both sides are a good choice for you, then it's a win. But in the larger scheme of things, you most often have resources which have no knowledge of each other, and this optimization no longer applies. Then you come back to old arguments: do XA with the overhead, or don't do XA and arrange for consistency and recovery yourself in the application.

    \joost de vries\
    Almost all uses of messaging that I encountered where in a 'wait for the response' fashion. Usually I implemented this with a timeout that resulted in a 'what you asked will be done, but not now' message and a callback-like application flow when the result becomes available, usually involving an email message.
    \joost de vries\

    It's sometimes tough to talk about this stuff in a JMS context because JMS embodies two wildly different messaging concepts into one package.

    On one hand, you have a message queue system, and in this area your observations hold - you may be decoupled from the receiver end, but you still want to interact with it in a back and forth manner.

    But on the other hand, you have a pub/sub system. The common example for this is a market data symbol ticker feed - price changes on equity instruments like IBM, Sun, etc coming in asychronously in real time. This second model doesn't fit your notion of "wait for a response" at all, because you really have a situation of a few senders effectively "broadcasting" to many receivers, and the broadcasters literally don't care at all whose listening (or even if anyone is listening at all). Peer-like communications systems, which are often used at the application level via JMS, also follow this model. Various components are broadcasting their state, and its up to other components to use that information or not as they choose, but the broadcasters really need no knowledge of who's listening or why.

    Durable subscribers in the pub/sub realm only serve to muddy the waters even further, bringing some queue-like semantics into a broadcasty realm.

    Personally, I'd argue that the pub/sub use of JMS is far more common than the queue side - for the simple reason that single listeners are so limiting. Even when a single listener seems apropos, it seems inevitably a second or third listener would be a "natural" architectural expansion at some later date, and hacks start to commence to try to shoehorn multiple listeners (sort of) into a queue system, most often by queue explosion to manage workflows.

    Anyway, I'm getting a bit off track here. If you accept that the queueing domain may be the "lesser" one in JMS (and I realize for some people this is an unbearable stretch!), then you really don't want to build up too much J2EE architecture around a JMS 'wait for response' pattern, because it's not going to be applicable to many real application needs.

    \joost de vries\
    A related topic that I haven't investigated is an approach where each quality of ACID (Atomicity, Consistency etc) can be relaxed individually on a transaction implementation. I guess this would involve some inferencing to make sure what qualities can be guaranteed on the level of the encompassing, multi-resource manager transaction. Maybe this is a topic that is still in the shadows of the academic world. Has anybody more information on this?
    \joost de vries\

    Well, applying all of ACID to JMS in particular is a bit of a stretch if you scrutinize it closely. Because of the way sessions work in JMS, there are fundamentally no locking or visibility issues, and likewise there really can't be anything like a "dirty read" concept or the like in JMS, again because you're not talking about a global resource like RDBMS tables, but instead sessions that are already isolated by design. All of ACID you really see is atomicity and durability - the consistency and isolation sides don't really apply. I think killing atomicity would kind of defeat any transactional system, so you're down to durability. On that front, you can play some useful games. For example, wha t if you've got a JMS provider in a global transaction, and all the messages therein happen to be NON_PERSISTENT? I don't recall the spec really saying much about message guarantees and XA interactions, but it may be possible to take the stance that a transaction containing all NON_PERSISTENT messages on the JMS side could skip all disk forcing in prepare() and commit(), because by definition the underlying messages aren't supposed to be reliable anyway.

    The RDBMS side, and similar systems, are much tougher to play similar games with. By their natures, you're working with shared tables and shared rows, and locking and visibility issues rear their ugly heads. Try to relax ACID too far, most particularly the Durability piece, and you can easily end up with a "database" that can be trivially put into a highly inconsistent state, or trivially ends up locking up resources with no way to automatically release these locks in a safe and consistent manner.

    The best approach I've seen which doesn't use XA, and best here is a relative term, is to use all/most local transactions and compensating transactions to do rollbacks. It's "best" in that you can make it work, but it's no where near as transparent as XA is to the end-user. You have to do _alot_ of work to write those compensating transactions, and even in straightforward processing you can end up with inconsistent state being visible to users for longer periods the application developers are normally used to. The app developer has to live with a system where if a failure happens somewhere, various resources can be wildly inconsistent with each other for many seconds until everything is brough back up and "compensated" properly. The XA side has smaller windows of inconsistency, resolution w/out application code, and makes some trade offs that end up with you having to deal more with long-lived locks than anything else. This is not to say that XA systems can't be inconsistent, they can be, but usually inconsistency issues are more easily understood by developers, and resolution in the face of problems comes free in terms of developer coding time.

    \joost de vries\
    By the way since we seem to be in the benign theme of finding in Mikes article subliminal easter eggs disguised as spelling mistakes I'd like to point one out: although I'm not a native english speaker I'm quite convinced that 'that adheres to ACID principals' smacks more of cynical headmasters then of transactions. Shouldn't that be 'adheres to ACID principles'? :-)
    \joost de vries\

    Great Caesar's Ghost! I must have had some serious childhood trauma or something judging by all these slips. I might have to trundle off to a shrink to figure out why all these monkies, apartment dwellers, and student authority figures are tripping out on acid in my head....

         -Mike
  104. relaxing ACID?[ Go to top ]

    Yes Ganshyam, you're right, it was in the context of long running workflow transactions over webservices that the topic of relaxing ACID arises. Although with Mike's detailed comments I'm not sure what that would exactly entail or offer. But the JSR en BTP spec's are a good starting point to find out. Thanks!

    Mike, you're right of course. I had forgotten the crucial point of Larry Jacobs' story: that Oracle implements both in de database with the specific advantages you mention. Thanks for the rest of your elucidations.

    Joost
  105. Question for Mike Spille please[ Go to top ]

    Hi Mike.

    So from your article, I now understand that the "prepare" phase is where the main occurrence of overhead is when compared to a 1 phase commit.

    But, if my system is operating in "batch" mode, whereby every hour, I transfer messages from an MQ Series Queue into an Oracle database from within my TP monitor. Let's assume that on average, 20,000 messages are transferred, and are transferred as a single transaction (rather than 20000 seperate transactions), am I correct in saying that the extra overhead in the 2PC in this case is likely to be negligible compared to the overall effort to insert 20000 records into Oracle whilst deleting 20000 records from the MQ series Queue?

    Thanks

    Paddy
  106. Question for Mike Spille please[ Go to top ]

    \Paddy Murphy\
    So from your article, I now understand that the "prepare" phase is where the main occurrence of overhead is when compared to a 1 phase commit.

    But, if my system is operating in "batch" mode, whereby every hour, I transfer messages from an MQ Series Queue into an Oracle database from within my TP monitor. Let's assume that on average, 20,000 messages are transferred, and are transferred as a single transaction (rather than 20000 seperate transactions), am I correct in saying that the extra overhead in the 2PC in this case is likely to be negligible compared to the overall effort to insert 20000 records into Oracle whilst deleting 20000 records from the MQ series Queue?
    \Paddy Murphy\

    In general terms, the more work you perform within a single transaction, the less impact you're going to see from XA overheads. But at the same time, I've focused on the _minimum_ added overheads you're going to see from XA, not maximums.

    Your specific example is an interesting one. Without a doubt, if you organized this into 20,000 individual transactions, with each transaction consuming one message from MQ and inserting one record into Oracle, then XA is going to be a very significant chunk of the overall processing time. But re-organizing it into one giant transaction of 20,000 consumptions and 20,000 inserts? In that case, strict XA overhead would probably be minimal, but at the same time you'd be stressing the general transaction systems on both sides. On the MQ side, it would have to track all 20,000 messages being consumed until a commit or rollback happened, and this could be a bit much for it to handle. Likewise, if you did a straightforward reorg of 20,000 inserts in one transaction, Oracle would have to track all that information in its tran log. It's possible in such a situation that you could blow out limits on either side - for example, you could fill up the tran logs completely on either side.

    In any case, you'd have one very big and presumably fairly long transaction, which would almost certainly greatly minimize XA overhead, but could have negative repurcussions elsewhere.

    When you start talking about doing a whole bunch of work transactionally, that's where you have to start making judgement calls and making choices carefully. Some people would "chunk" the work into sizes more palatable to the transaction system, maybe 1,000 messages at a time. This would require higher level application logic to effectively doing a "compensating transaction" if a failure happened somewhere along the lines. Or you could possibly take a bulk-copy approach to blast the data into Oracle - which has its own trade offs.

         -Mike
  107. Mike, Where can we get a copy of your article? I used to see it at www.jroller.com. But it seems the article is not showing any more. Thanks, Sid
  108. My bad. It is there.