Discussions

News: XA Exposed, Part II

  1. XA Exposed, Part II (38 messages)

    Mike Spille started quite a discussion with his first article on distributed transactions. His second part goes deeper into the world of XA, and the issues that exist there. He goes into details on failure categories, recovery, and the nooks and crannies of XA.

    "Now, with Part II, we're going to go into alot more depth on this subject. Part I really covered only the easy parts when nothing goes wrong, with just a few references to failure states and recovery. Here in Part II we're going to dive deep into the murky depths of rollbacks, heuristic decisions, optimizations like Presumed Abort, the 1PC optimization, and similar more advanced concepts. In programming, they say the last 10% takes 90% of the effort, and the same is true when trying to explain the XA protocol. When everything works, XA is pretty easy to explain. But when you have to deal with correctness, and durability, and you wanna do it all as fast as possible, then you're talking a whole 'nuther kettle of fish. All the corner cases and exceptional conditions take alot more effort to deal with and explain then the straight line stuff. So gird your loins, cinch in your belt, and let's get it on...."

    Read Mike Spille on XA Exposed, Part II

    Read the TSS discussion of the first part

    Threaded Messages (38)

  2. XA Exposed, Part II[ Go to top ]

    Man,
    Thats the funniest thing I've read in ages, pretty cool.

    Billy (now checking for Satan in the toilet...)
  3. XA Exposed, Part II[ Go to top ]

    Good this has caught your attention. Can you tell what Websphere uses as it's TM?
  4. XA Exposed, Part II[ Go to top ]

    WebSphere has its own TM which is owned by a team in Hursley, UK.

    Billy
  5. XA Exposed, Part II[ Go to top ]

    Is it Component Broker?
  6. XA Exposed, Part II[ Go to top ]

    No,
    WAS never used the component broker TM, CB was a C++ app server and was basically all C++ code. The WAS TM ancestry goes back to Encina rather than CB. The Java TM that WAS uses was originally written by a bunch of ex Encina guys from Pittsburgh AFAIK. It was later moved to IBM Hursley and has seen significant modifications especially for 5.1.

    Billy
  7. WAS TM in 5.x[ Go to top ]

    Billy, are there any signficant changes to the transaction log side of the house in WAS 5? I have a pretty good idea how things operation in 4.x, but none at all in 5, and I'm curious if anything's been changed/overhauled/optimized?

    I'd also be curious in knowing the Weblogic tranlog POV, and what approach Geronimo might be taking in this area as well.

        -Mike
  8. geronimo use jotm[ Go to top ]

    http://jotm.objectweb.org/

    maybe TM will be pluggable in geronimo?
  9. geronimo use jotm[ Go to top ]

    maybe TM will be pluggable in geronimo?


    The TM will be pluggable. Currently we use our own TM, which does not do logging, but we plan on using JOTM in the future.
  10. geronimo use jotm[ Go to top ]

    Does jtom have its own transaction log implementation at this time? I wasn't able to find one after a brief perusal of the source....

        -Mike
  11. Does jtom have its own transaction log implementation at this time?


    JOTM doesn't have a log implementation yet. A new ObjectWeb project (HOWL[1]) has
    been created to cover that need.
    This new project is BSD licensed so that Apache and ObjectWeb can collaborate
    on a common transaction log implementation.

    jeff

    [1] http://forge.objectweb.org/projects/howl
  12. HOWL[ Go to top ]

    Thanks for the reference, Jeff.

    I checked out HOWL, but was a bit disappointed. It sounds like a neat project, but at the same time it looks like overengineering for someone like me who just wants to see a transaction log implementation.

    Let me explain my point of view. Most especially for a transaction manager, transaction logging is actually a highly specialized activity, and at the same time surprisingly simple. At its core, you're disk forcing once per transaction (assuming you get passed the prepare() phase), and have tiny records containing little more than an identifier, an Xid. A tiny "DONE" record is additionally required, but unforced.

    Looking at HOWL, the goal seems much grander. From what I've seen, there are two seperate facilities being considered. On one hand, there's a tracing facility of some sort, very vaguely related to logging, which doesn't concern me. On the other hand, there's apparently a very ambitious general purpose journaling system in the early stages of design. This would not just be used for transaction logging, but for many other areas were binary persistent data was needed w/ a recover mechanism.

    That's all well and good, but I'm rather disappointed that this approach is being taken. A really good TM log could be written in a man-month or less specific to TM transaction logging usage, and even a relatively straight forward implementation could put you in the 200+ TPS range for average hardware - which is competitive with commercial offerings.

    My motivation here is purely for personal (well, corporate-personal) gain: I'd hate to see real XA transactional integrity for Geronimo be delayed a significant amount of time waiting for a grand journalling system to come on line. The Geronimo group, and in addition the ObjectWeb folks, may have other agendas and requirements to fulfill, but a waiting for a full journalling file package just doesn't seem right. Perhaps (?) fully safe and compliant XA transactions aren't a priority for Geronimo, but it would be a huge advantage if it were - it would be a major differentiator from competitors. I imagine Objectweb would also directly benefit :-)

    For what it's worth, I put together a bunch of notes for my next XA Exposed installment on implementation stuff, including a TM transaction perspective. Perhaps someone will find it useful, since I'm not sure when I'll have the time to finish that article...

         -Mike



    If a TM forces only "Committing..." records, and it crashes with transactions in-flight, then this means transactions within the XAResources can be in one of these states:

       1) XAResource has transaction(s) in flight, pre prepare() call.
       2) XAResource has transaction(s) in flight, voted yes to prepare()
       3) XAResource had commit called
       4) No in flight trans, no problem ;-)
      
    State #1 we ignore - if we haven't called prepare() on a Resource and crash, by definition we assume rollback. Now we just have to make sure everyone is called back. For state #1 itself, if a given resuorce hasn't been prepare()'d, we let it die within the resource. This happens within the XAResource either by detecting that its connection from the TM was dropped, in which case rollback happens, or by noting that no activity has happened to a transaction for X seconds, and thereby timing out the transaction (timeouts are _very_ important in XA if you don't want resources hanging during failure scenarios).

    For state #2, this could be a global transaction which we should commit, or it could be one we should rollback (for instance, if another resource is in state #1 above). See below on how to resolve this.

    For state #3, the resource won't report this back to the TM, so we won't see it in the TM at recovery time :-) See below for this.

    State #4 is of course the ideal state if a failure occurs, no transactions were in flight and there's nothing to do.

    The above states per XAResource are interpreted by checking "Committing..." records in the TM vs. what the XAResources report, and figuring out the correct resolution of the transaction . In practice, when a TM goes down "abby-normal", it scans its logs for transactions marked "Comitting..." but not "Done". It then reconstitutes its XAResources from info in the logs, and calls recover() on each. Each recover call returns an array of Xids (Xid[]). Each entry in this array, if any, is by definition in the prepare() state (has been prepared), but not committed. This is called an in-doubt transaction. The TM gathers up these in-doubt Xids, tries to match them against the global transaction IDs in the log in "Committing..." state, and uses this algorithm:

      1) If there's a "Committing..." record in the TM logs w/ no "Done", we _must_ ensure this transaction commits in all resources. We check all resources recover() responses, and if there are any Xid's in there that match these committing records, we call commit() on them. Then we write "Done" in the TM log for each of those transactions.

      2) If there are no Xids from recover() responses which match a "Committing..." record, that's OK - that just means that everything was committed but we didn't get a chance to write a "Done" record. So we write "Done" in the TM logs for these.

      3) If we find Xids left over from the above, where some resources responded to recover() with Xid's that don't match a dangling "Committing..." record, we roll 'em back. It doesn't matter if all the resources voted "yes" to prepare and are just aching to commit, and the TM just didn't get a chance to write "Commiting". We just roll these suckers back.

    The key is that a "Committing" record lets the TM distinguish between in-doubt transactions in the XAReources which should be prepared vs. those that should be rolled back.

    As an aside, "Done" records don't need to be disk forced. The above algorithm makes it so you don't have to.

    =========================================================

    With the above info in hand, some "fast" implementations fall out kind of naturally.

    First, record XAResource reconstitution information only once when it's first used in the container along with a handy global ID. This can be done in either the tran log directly, or elsewhere.

    The records can be small in the tran log can be very small - all you need is an ID for Committing..., the global tran ID (global part of Xid), and the list of XAResource IDs involved (see previous point). As a matter of fact you really don't need the XAResource list, since you can infer this info from the global transaction ID and results of recover() calls.

    Since records are small, try to avoid serialization at all costs - hand-coding I/O code is a pain, but there's not much to code, and you'll go _much_ faster and eat almost no CPU at write time.

    "Done" records just need a "Done" ID and the global part of the Xid. No forcing needed.

    The critical bottleneck is going to be disk forcing - records are small so their I/O really doesn't count. Because of this, batch up disk forces, not I/O writing.

    On disk batching - basically, dedicate a thread to do actual disk forcing. Have a queue in front of this thread containing "disk force" requests. Since disk forces can take anywhere from 5-50 milliseconds, at high transaction volumes you can actually batch many disk force requests from various transactions into one.

    Also since records are small and short-lived, use twin fixed sized, pre-initialized logs. You pre-initialize two logs at startup w/ an indicator somewhere as to which file is "active". You pre-init to ensure all disk stuff for it is also pre-allocated. Than, as transactions occur, you just keep writing forward until you approach the end of the active log. At this point, you need a pause. You take all pending/dangling Committing... records at that moment in time and copy them to the top of the inactive log, you flip an indicator to indicate that the inactive log is now the active log, and again start writing from the top (or from the end of the copied records).

    The log reading side at startup must be resilient to log corruption - this can happen if someone pulls the plug under heavy volume. Thanks to Presumed Abort, this is OK - if you hit corruption in the active log at startup, stop at the last known "good" record. Any "Committing..." records in this known-good zone of the logs may need commits issued on XAResources. Anything else the XAResources report via their recover() methods, you rollback.

    Have a durable indicator somewhere indicating good shut down. You can use this indicator to avoid unnecessary querying of the XAResources on startup if a good shut down was effected (e.g. only try recovery if you need to).

    Take Xid generation seriously, especially if you ever plan to cluster servers. Basically, each server needs to be able to recognize its own Xid's, and to ignore other Xid's (either from your own code, or from someone else's!). Basically, you should pick a format number that identifies your TM transactions, and then you need to encode a unique per-server ID into the branch ID so that a given server instance can tell its own transactions from another server's on the cluster.

    The keys overall to such an implementation: only disk force "Committing" records, keep Commiting records small, don't use Serialization, use batching of disk forces. A possible optimization would be to use multiple log files concurrently, as several people have mentioned on TSS.
  13. HOWL[ Go to top ]

    You've overlooked an important piece of info in your algorithm:

    If we find Xids left over from the above, where some resources responded to recover() with Xid's that don't match a dangling "Committing..." record,... We just roll these suckers back.

    You only rollback the branches represented by Xis generaged by your TM instance. This is typlically done by first filtering the Xid's returned from the RM prior to comparing them with what you've logged. If the filtering doesn't occur, you're likely to rollback transactions owned by a different TM or an instance of your TM running in a different server. Basically,

    The TM gathers up these in-doubt Xids, tries to match them against the global transaction IDs in the log in "Committing..." state

    should state that you only deal with the subset of Xid's your TM owns.
  14. HOWL[ Go to top ]

    Thanks for the corrections, Matthew. I got it right in the article, got it wrong in the post here :-( In the article it's marked as step #4 in Straight Line Recovery.

        -Mike
  15. XA Exposed, Part II[ Go to top ]

    I just want to point this out to people, a price list for solid state disks. Mike mentions EMC raid arrays with gigs of memory, with the latency still being 5ms, whereas a solid state disk should do much better:

    http://www.storagesearch.com/ssd-buyers-guide.html

    One of the disks included claims 250,000/s thruput and a latency of around 10 us (i.e. micro-s, not milli-s) Note: I don't sell solid state drives :-) but I think they are relevant here. There's a white paper on this particular drive here:

    http://www.texmemsys.com/files/f000165.pdf

    The main drawback to using this drive is the price. The buyers' guide above claims that it costs $35,000. But for a lot of organizations it might be very doable.

    Perhaps Mike's EMC array only achieves 5ms latency because it's shared with a lot of other users, but I actually think it works differently. The drive described above saves data to memory, and backs it up to disk in the background (with battery backup). I am not sure if EMC arrays attempt to do this. If anyone knows a lot of disk drives I would love to get their input here.
  16. XA Exposed, Part II[ Go to top ]

    I did a bit of research to answer my own question (why Mike's EMC array is exhibiting 5ms latency.)

    This is a nice EMC whitepaper talking about the uses of NVRAM (memory-backed RAM) to take the disk out of the critical path (see pp 5-7:)

    http://www.emc.com/products/product_pdfs/performance.pdf

    A quote:

    "[EMC's] Symmetrix cache is not directly accessible [..] and is accessed indirectly using SCSI commands, which take on the order of several hundreds of microseconds [..]"

    So whereas a solid state disk by itself might achieve 10us latency, going over scsi increases the latency by a factor of ten already.

    Then I found this benchmark, which is quoted in the above paper:

    http://www.spec.org/sfs97r1/results/sfs97r1.html

    This is an NFS benchmark, showing throughput and latency, and it shows latencies on the order of 2-3ms. This one is about 1ms - it's an IBM system:

    http://www.spec.org/sfs97r1/results/res2003q4/sfs97r1-20031009-00163.asc

    Interestingly, the EMC white paper makes the point that for availability you need to use shared-disk, which implies that the NVRAM has to be inside the disk, and that this is why the latency goes up. It also talks about systems which put the NVRAM *outside* the disk, and achieve minute latencies, but with less availability.

    In any case, even the latency is a few milliseconds, the thruput is huge for any of these disk arrays.

    Mike: can you quickly remind me why the high latency slows down the entire system? Where is the contention? Do all your transactions update the same data items? You mentioned somebody's suggestion that multiple logs should be used. Would those contain records from non-conflicting transactions?
  17. XA Exposed, Part II[ Go to top ]

    The way most of their SAN servers which harden to NV and then do asynch writes to DASD works is like this:

    The disks are shared between two SAN servers typically on redundant fiber loops. One SAN server is typically the primary for a subset of the disks and the other for the rest of the disks.

    SCSI/iSCSI/FCP requests go to the primary. The primary writes to the NV and also writes to the standby SAN servers NV. Effectively, we have mirrored NV. Then the SCSI/iSCSI/FCP command completes. If the primary fails then the standby takes over and it all keeps chugging along.

    This mirroring adds to latency as it is usually done over a fiber ring possibly shared with the DASD loops. So, there is one hop to the SAN, max of two for the mirror operation (req/resp), and the normal response so it's 2x the latency (4 hops instead of 2) because of the availability aspect which isn't optional for most customers.

    When the blocks are copied to the DASD, these copies result in items being removed from the NV and hence also require the standby to be kept up to date which adds traffic to the loops.

    Exactly how this is all done depends on the vendor but I know EMC, IBM (FASTT/ESS) and NetApp all have this kind of mechanism in the products.

    An alternative to SAN servers is to use SSDs which are sold by vendors like imperial tech which are basically battery backed DRAM with a notebook drive inside for permanent storage in case the battery goes.

    I remember before this mechanism reading in a manual, "On a failure, remove the DIMM module with the battery and plug it in to the new system..."

    Billy
  18. Disky Business[ Go to top ]

    \Guglielmo Lichtner\
    Interestingly, the EMC white paper makes the point that for availability you need to use shared-disk, which implies that the NVRAM has to be inside the disk, and that this is why the latency goes up. It also talks about systems which put the NVRAM *outside* the disk, and achieve minute latencies, but with less availability.
    \Guglielmo Lichtner\

    If you're interested in High Availability, the above really is key. You basically want a "disk system" of some sort which is linked to more than one system. You may typically only have the device actually mounted on one system, but if that system fails you chop off access from the failed side and then mount on the "secondary" machine.

    In that sort of scenario, you really don't want caching at the controller level, but at the true shared component part e.g. inside of the shared "disk system".

    Looking at the links you posted, some of the solid state drives listed would be inappropriate in this role since they can be physically attached to only one machine at a time. Iff you're not going for HA, then of course you can get reasonable durability, so long as you don't mind losing everything if the machine goes down.

    I haven't researched enough into your links to see how these systems deal with multi box access (if they do at all). When I get a moment I'll check it out....

    \Guglielmo Lichtner\
    Mike: can you quickly remind me why the high latency slows down the entire system? Where is the contention? Do all your transactions update the same data items? You mentioned somebody's suggestion that multiple logs should be used. Would those contain records from non-conflicting transactions?
    \Guglielmo Lichtner\

    Generally people do transaction logs as write-forward only. You never seek around, and you never read from the log (unless you're just coming up). In addition, your log writes obviously need to be atomic in nature - you want whole records to go out intact as one unit. This atomic writing implies that you have to serialize on the log access.

    The idea of using multiple logs may still result in contention on the disk itself, if you're putting the logs all on the same disk, but removes some of this serialized contention - you could distribute transactions over "N" logs, probably in a sticky fashion with a round-robin distribution when new transactions come in. This will remove any Java-level contention, and possibly contention within the OS as well. If the logs are on different physical devices, you could enjoy a great deal more parallelism. All of this is in theory, at least as far as I'm concerned, since I haven't actually tried this myself yet. But I think I will be trying it sometime in the near future.

         -Mike
  19. Disky Business[ Go to top ]

    I've been playing with this idea also but the problem with round robining over multiple logs is that recovery needs to be over the multiple files rather than just a single file. As far as IO being better, I'm skeptical as if you're using a SAN for the log files then it's kind of mute anyway as the SAN is writing to NV and not physical disks so multiple LUN volumes isn't really helping IO performance because of this.

    The main benefit potentially with multiple log files would not be IO but more concurrent paths through the TM to avoid it bottlenecking around the typical group commit type code that can become a real bottleneck with very high transaction rates especially when you're response time sensitive. But again, the multiple physical logs would be more complicated to do recovery on when restarting.

    Billy
  20. Disky Business[ Go to top ]

    The main benefit potentially with multiple log files would not be IO but more concurrent paths through the TM to avoid it bottlenecking around the typical group commit type code that can become a real bottleneck with very high transaction rates especially when you're response time sensitive. But again, the multiple physical logs would be more complicated to do recovery on when restarting.


    I was trying to figure out if Oracle supports multiple redo logs. The regular server doesn't (it supports copies, for availability). But Parallel Server does support parallel logs:

    http://download-east.oracle.com/docs/cd/A87860_01/doc/index.htm

    Quote:

    "In Oracle Parallel Server, each instance writes to its own set of online redo log files. The redo written by a single instance is called a "thread of redo". Each online redo log file is associated with a particular thread number. When an online redo log is archived, Oracle records its thread number to identify it during recovery."

    I wonder if it is as simple as reading the SCN (system change number) for each transaction and applying the data blocks in order?
  21. Disky Business[ Go to top ]

    \Billy Newport\
    I've been playing with this idea also but the problem with round robining over multiple logs is that recovery needs to be over the multiple files rather than just a single file.
    \Billy Newport\

    The added complexity doesn't seem all that high to me. If you can recover from 1 file, why not from "N"? Of course I'd constrain things so a transaction would be sticky to 1 file (or logical pair of files if a flip-flop dual set of files is used).

    \Billy Newport\
    The main benefit potentially with multiple log files would not be IO but more concurrent paths through the TM to avoid it bottlenecking around the typical group commit type code that can become a real bottleneck with very high transaction rates especially when you're response time sensitive. But again, the multiple physical logs would be more complicated to do recovery on when restarting.
    \Billy Newport\

    Well, it's really a question of blocking not just in the TM, but blocking from the TM all the way down to the disk, with everything in between. Using multiple files _may_ open multiple areas for concurrency, not necessarily just in the TM but through the OS (and maybe JVM) as well.

    You may be right that might not be a huge win for a SAN, but perhaps it could be. However, at the same time not everybody has a SAN. Using multiple files could be a huge boon for people using plain old disks, or low end RAID arrays, and it gives them the possible option of splitting files over multiple devices. This can increase your risk a bit (more disks mean more possible failures), but at the same time you can theoretically almost double your throughput if you have 2 active tran logs on 2 seperate devices.

         -Mike
  22. Disky Business[ Go to top ]

    The added complexity doesn't seem all that high to me.

    > If you can recover from 1 file, why not from "N"?

    I agree. You could just add a pass on front of your recovery procedure and collect all log files to a single one and use your old recovery stuff then.

    Problem I see is to keep the logs in a time sequence. You certainly can't use a simple time stamp here but must use some singleton to generate sequence numbers which adds synchronization again.

    But since redo logs aren't a new technique but used since decades, there must be a bunch of books and research available how to implement it.
  23. Disky Business[ Go to top ]

    \Andreas Mueller\
    Problem I see is to keep the logs in a time sequence. You certainly can't use a simple time stamp here but must use some singleton to generate sequence numbers which adds synchronization again.
    \Andreas Mueller\

    A good point which I hadn't considered. Post log reading you'd want to sort the transactions properly. In my own work I wanted a key that was shorter and easier to match on than Xid, so I used a combination of System.currentTimeMillis() (a long) along with a monotonically increasing integer sequence number. The downside of this approach is that, for whatever the resolution of the time call is, I can only allow a few billion unique transactions. Fortunately I'm nowhere near a few billion trans a second, let alone a few billion trans on a sub-second scale, so I'm safe for now.

    Taking this approach avoids having to use an external sequence number generator, and I don't have to worry about collisions because the ID is used only within individual processes and their logs, not used inter-process.

         -Mike
  24. 2PC + Oracle + WAS 4.x disagreement[ Go to top ]

    Maybe all you gurus can get us out of a bind, or at least help us lay blame on the guilty. We have a setup where WAS transactions have two XA DataSources that ultimately point to different users of the same Oracle 8 database instance. WAS issues the following sequence of commands to Oracle:

    1. res1.end()
    2. res1.prepare()
    3. res2.end()
    4. res2.prepare()

    Oracle gets its panties all up in a bunch (throws an exception) because there are prepares occurring before the ends have happened.

    My question is, whose product has the bug? Does either JTA or XA dictate that the WebSphere TM should initiate all its end()s before its prepare()s, or is Oracle being oversensitive since end and prepare come in order on each resource?

    Not to rain on the XA parade, but at least one of these two major vendors has a bug that keeps us from being able to use XA at all.

    Of course any hidden flags that would work around the problem (e.g. get WebSphere to reorder the calls) would lead to eternal gratitude.
  25. 2PC + Oracle + WAS 4.x disagreement[ Go to top ]

    Hmm, it could be caused by this:

    http://www-1.ibm.com/support/docview.wss?rs=860&context=SW600&q1=Oracle&uid=swg21109954&loc=en_US&cs=utf-8&lang=en+en

    This apparently involves coupling of transaction branches, which sounds alot like your problem. According to the report:

    \IBM\
    Oracle must be at at least the 8.1.7.4 or 9.2.0.2 level and patch 2511780 (obtained from Oracle Corporation) must be applied. If a higher version of Oracle is used, please verity with Oracle if patch 2511780 is included.
    \IBM\

    This goes against 4.0x and looks like it was posted in May of 2003.

    In general, for this sort of thing you should search the IBM and Oracle support sites.

    \Charles Bear\
    Not to rain on the XA parade, but at least one of these two major vendors has a bug that keeps us from being able to use XA at all.
    \Charles Bear\

    Oh, I understand where you're coming from, I most definitely do. From Part II of my article, I closed saying:

    \XA Exposed, Part II\
    But what application programmers, architects, and managers need to be aware of is that not everyone plays nice together, and failures in an XA environment can be tangled skein to unwind. It's fairly well known in the industry that some RDBMS' XA drivers are full of problems (*cough* Oracle *cough). What's worse, many major JMS implementations switch off disk forcing altogether by default. This makes them look really, really good on benchmarks, and makes you look really, really bad if Little Johnny pulls the plug on your JMS server at an inopportune moment. And the lack of clarity in the JTA specs has lead to TMs that are mass of spaghetti code, trying to deal with many esoteric XA errors each of which seem to pop up in at least one of the XAResource implementations out there.
    \XA Exposed, Part II\

    Your particular problem falls squarely on the super-fuzzy JTA specification. In particular, each resource in an XA system is supposed have a concept of "resource manager identity", where the transaction manager call call "isSameRM()" and possibly join transaction branches (this topic is way beyond the scope of any of my articles - it would have to be XA Exposed, Part MIV). The problem is that the JTA spec, and also X/Open, don't really define clearly what isSameRM() is used for, what it means, or what the rules for "resource manager equality" really are.

    At the same time, the X/Open spec fuzzily states some stuff about what an RM is allowed to do.

    In your case, most likely Oracle is counting res1 and res2 as the same XAResource for some reason. This could be due to the patch on IBM's website. Or it could be a subtle misconfiguration of the resources. Or settings on the Oracle side peculiar to XA. Or an outright still as yet unfixed bug in either the Oracle XA drivers, the RDBMS itself, or Websphere.

         -Mike
  26. My question is, whose product has the bug? Does either JTA or XA

    > dictate that the WebSphere TM should initiate all its end()s
    > before its prepare()s, or is Oracle being oversensitive since
    > end and prepare come in order on each resource?

    No, the bug is at WAS. Usually a 1PC optimization would be used here. So either isSameRM() isn't called from WAS or Oracle doesn't identify the XAResource passed to isSameRM() as its own and hence 2PC takes place. A reason why it doesn't identify the passed XAResource as its own might be that some wrapper is passed here. I have seen this [wrong] behavior from 2 different TMs yet (they pass a XAResourceWrapper).
  27. \Andreas Mueller\
    No, the bug is at WAS. Usually a 1PC optimization would be used here. So either isSameRM() isn't called from WAS or Oracle doesn't identify the XAResource passed to isSameRM() as its own and hence 2PC takes place. A reason why it doesn't identify the passed XAResource as its own might be that some wrapper is passed here. I have seen this [wrong] behavior from 2 different TMs yet (they pass a XAResourceWrapper).
    \Andreas Mueller\

    I don't necessarily agree with your analysis. As the original poster stated, he was accessing with two connections with two different users. That usually results in a 2PC transaction, even though it's the same database. The TM and RM _could_ potentially collaborate, but that's a very murky area of JTA.

    One thing they do say on this subject, in a footnote:

    \JTA Spec\
    The RM may receive additional work on behalf of the same transaction, from different branches. The different branches are related in that they must be completed atomically. Each transaction branch identi?er (or XID) that the TM gives the RM identi?es both a global transaction and a specific branch. The RM may use this information to optimise its use of shared resources and locks.?
    \JTA Spec\

    So if the TM tells you there are two branches, there are two branches.

    It could be that the problem of the original poster was caused by multiple problems. It could be that WAS was misusing isSameRM(), it could be that Oracle misapplies isSameRM() (whose semantics are effectively undefined from what I've seen - as I've said elsewhere, how do you define RM equality?), it could be that Oracle is joining work together without WAS' knowledge.

         -Mike
  28. I don't necessarily agree with your analysis. As the original

    > poster stated, he was accessing with two connections with two
    > different users. That usually results in a 2PC transaction,
    > even though it's the same database.

    Not necessarily ;-). But I missed the different users so it could be, of course, that isSameRM returns a "proper" false. On the other hand, why should Oracle throw an exception if 2 different branches (2 different Xids) are used?

    > So if the TM tells you there are two branches, there are two branches.

    Right, but then it has to end every associated thread before calling the first prepare.

    > isSameRM() (whose semantics are effectively undefined
    > from what I've seen - as I've said elsewhere, how do you
    > define RM equality

    I would say the semantic is defined by the RM itself. Because isSameRM is mostly used to decide towards 1PC optimization, a return of true would require the RM to handle that.
  29. \Andreas Mueller\
    Not necessarily ;-). But I missed the different users so it could be, of course, that isSameRM returns a "proper" false. On the other hand, why should Oracle throw an exception if 2 different branches (2 different Xids) are used?
    \Andreas Mueller\

    Well, I'd say it was because someone had a bug :-) Given that the the IBM fix posting mentioned a patch to Oracle and Websphere, it could be that Oracle was doing something a bit funky, and Websphere is now invoking a private API on Oracle JDBC to force more expected behavior. From what I've seen of Open source TM's, there's lots of special checks for specific RMs and special code to deal with their, ah, quirks, so this wouldn't surprise me.

    \Andreas Mueller\
    Right, but then it has to end every associated thread before calling the first prepare.
    \Andreas Mueller\

    I don't think this is necessarily the case. I'd have to study the specs in some more detail, but I don't think there's any requirement that end() has to be called on all branches before prepare() can be called.

    \Andreas Mueller\
    I would say the semantic is defined by the RM itself. Because isSameRM is mostly used to decide towards 1PC optimization, a return of true would require the RM to handle that.
    \Andreas Mueller\

    Hmmm, true. From what I've seen, this check is used at initial enlistment. The TM iterates through all existing RM's on the transaction, calling isSameRM() on each with the new guy, and if it returns true it gives the Xid from the "other" XAResource to the new guy.

          -Mike
  30. Maybe all you gurus can get us out of a bind, or at least help us lay blame on the guilty. We have a setup where WAS transactions have two XA DataSources that ultimately point to different users of the same Oracle 8 database instance. WAS issues the following sequence of commands to Oracle:

    >
    > 1. res1.end()
    > 2. res1.prepare()
    > 3. res2.end()
    > 4. res2.prepare()
    >
    > Oracle gets its panties all up in a bunch (throws an exception) because there are prepares occurring before the ends have happened.

    I don't know, but giving an actual error number would help here.

    In case this happens to me later, I tried to look this up. I found this in the Oracle docs:

    http://download-west.oracle.com/docs/cd/A91202_01/901_doc/java.901/a90211/xadistra.htm#1061004

    It explicitly states that if you use N oracle connections to the same database then it will return XA_OK from 1 and XA_RDONLY from the other N-1, so the TM can do the one-phase commit optimization.

    This guy actually posted a message on dejanews complaining about this behavior:

    http://groups.google.com/groups?q=isSameRM&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=3E9D4D52.4080802%40gmx.at&rnum=4

    So if this is actually happening, then it doesn't matter if isSameRM works or not, because Oracle is faking read-only transactions to fool the TM into doing a one-phase commit. So isSameRM doesn't have to return true.

    However, this is all with JDBC driver version 9.0.1. I found a page that states that you can use this driver with older versions of Oracle, so if you are using the 8.0 driver and the functionality is in the driver (?) then you can try upgrading the driver.

    I'd be curious to hear how it works out.
  31. Very interesting, and very weird. Thanks for shedding additional light on it.

    From where I sit, the mildest adjective for this sort of behavior would be "evil". Resource Managers really aren't supposed to be doing this sort of thing. At the same time, I wonder how exactly they do it (I don't have a current login to the Oracle site, and so can't peruse the reference you mention).

    For this to work, Oracle would somehow have to detect that multiple connections were in fact being used within the same global XA transaction. Perhaps they're peeking somehow into JTA, but I don't think so. More likely, they're looking at the global tran ID within the Xids, and comparing notes between XAResources (or connections, depending on their implementation), and if the global tran IDs match for multiple connections with active XA transactions going on, then they pull this little trick.

    But there's a problem with this. From the JTA spec:

    \JTA Spec, Page 13 Footnote\
    1.Transaction Branch is defined in the X/Open XA spec [1] as follows: “A global transaction has one or more transaction branches. A branch is a part of the work in support of a global transaction for which the TM and the RM engage in a separate but coordinated transaction commitment protocol. Each of the RM’s
    internal units of work in support of a global transaction is part of exactly one branch. .. After the TM begins the transaction commitment protocol, the RM receives no additional work to do on that transaction branch. The RM may receive additional work on behalf of the same transaction, from different branches. The different branches are related in that they must be completed atomically. Each transaction branch identifier (or XID) that the TM gives the RM identifies both a global transaction and a specific branch. The RM may use
    this information to optimise its use of shared resources and locks.”
    \JTA Spec\

    The footnote applies to this section on resource enlistment:

    \JTA Spec, Page 12\
    If the target transaction already has another XAResource object participating in the transaction, the Transaction Manager invokes the XAResource.isSameRM method to determine if the specified XAResource represents the same resource manager instance. This information allows the TM to group the resource managers who are performing work on behalf of the transaction.

      • If the XAResource object represents a resource manager instance who has seen
    the global transaction before, the TM groups the newly registered resource
    together with the previous XAResource object and ensures that the same RM
    only receives one set of prepare-commit calls for completing the target global
    transaction.

    If the XAResource object represents a resource manager who has not previously seen the global transaction, the TM establishes a different transaction branch ID1 and Java Transaction API ensures tha this new resource manager is informed about the transaction completion with proper prepare-commit calls.
    \JTA Spec\

    The spec goes on to show how isSameRM() is supposed to be used with a psuedo code snippet (although it doesn't really say what "same instance" means :-():

    \JTA spec isSameRM Psuedo Code\
      public boolean enlistResource(XAResource xares)
      { ..
         // Assuming xid1 is the target transaction and
         // xid1 already has another resource object xaRes1
         // participating in the transaction
         boolean sameRM = xares.isSameRM(xaRes1);
         if (sameRM) {
           //
           // Same underlying resource manager instance,
           // group together with xaRes1 and join the transaction
           //
           xares.start(xid1, TMJOIN);
         } else {
           //
           // This is a different RM instance,
           // make a new transaction branch for xid1
           //
          xid1NewBranch = makeNewBranch(xid1);
          xares.start(xid1NewBranch, TMNOFLAGS);
        }
        ..
      }
    \JTA spec isSameRM Psuedo Code\

    So, when all's said and done, if you want to do an optimization to consolidate resources, you can return true from isSameRM(), and the TM will consolidate for you. But if isSameRM() doesn't return true, then you have seperate branches, and the spec says that the RM has to honor those seperate resources. It can consolidate locks and such behind the scenes, but it must behave normally from the TM's perspective.

    I'd say Oracle has got it wrong in this instance.

         -Mike
  32. So, when all's said and done, if you want to do an optimization to consolidate resources, you can return true from isSameRM(), and the TM will consolidate for you. But if isSameRM() doesn't return true, then you have seperate branches, and the spec says that the RM has to honor those seperate resources. It can consolidate locks and such behind the scenes, but it must behave normally from the TM's perspective.

    >
    > I'd say Oracle has got it wrong in this instance.

    I think it's fair to say that :-) They were probably being defensive. Or this is just how their own TM decides to do one-phase ...
  33. XAResource.isSameRM[ Go to top ]

    Hi,

    As a JTA vendor we had an email conversation with Sun's JTA spec team about this to clarify.

    According to what we got back:

    "isSameRM is true if and only if both resources return the same set of XIDs as the result of calling recover()"

    Many vendor implementations (mainly JMS) have a different behaviour, though.

    Best,
    Guy
    http://www.atomikos.com - mission-critical Java
  34. The Oracle optimization only works if you explicitly end all branches with the same formatID and gtrid before preparing any of the branches.

    WebSphere used to perform the end and prepare on each resource in turn but changed the behavior drive end on all resources before prepare to take advantage of this.

    As for the exception, it's an Oracle bug.
  35. 2PC + Oracle + WAS 4.x disagreement[ Go to top ]

    set "transactionBranchesLooselyCoupled" datasource custom property to true.

    and

    you need Oracle patch#2511780 which in the Oracle 8i release has to be applied on top of 81740
  36. It seems to me that the statement that XA requires 5 disk forces for a transaction that enlists 2 XA resources is not quite right. When an application calls commit() on an XA transaction, the TM needs to call prepare() on both of the XA resources. These calls can be made in parrallel since one does not depend on the other, and if each resource manager uses different disks a high degree of parrallelism should be achieved.

    After both resource managers finish preparing and vote yes in response to the prepare() call, the TM needs to record that the transaction commited in its log, and force its log to disk. After the TM has recorded the sucessful commit of the transaction, it can return from the commit() call issued buy the application, and the application can continue processing. The call to commit() on each of the XA resources can be performed afterwards (hopefully soon afterwards since the resource managers need to hold onto locks until the transaction is commited) by the TM. But the commits can be performed in parrallel with other application work, and not keep the application waiting for 2 additional resource manager disk forces.

    This works because once the TM records that the transaction has committed, the transaction is going to commit. Heuristic failures are possible, but there is nothing that the application can really do about them. Any heurisitic failures should be reported by the TM to responsible individuals who can try to sort out the mess, but there is no need to report them to the application.

    Using this algorithm, the commit cycle for 2PC only requires 1 additional application blocking disk force, compared to a non-2PC alternative. It does not seem as though most Java app server based TM's implement 2PC this way, but it would make them a fair bit faster if they did.
  37. \Rowland Smith\
    It seems to me that the statement that XA requires 5 disk forces for a transaction that enlists 2 XA resources is not quite right. When an application calls commit() on an XA transaction, the TM needs to call prepare() on both of the XA resources. These calls can be made in parrallel since one does not depend on the other, and if each resource manager uses different disks a high degree of parrallelism should be achieved.
    \Rowland Smith\

    Well, 5 forces have to happen no matter what. The question then becomes, who pays the cost?

    A TM could indeed issue the prepare() and commit() calls in parallel to all resources. One person from BEA claims that Weblogic does this, but other than that one post I haven't seen that confirmed by anyone else or from any literature. I don't know of any other app server that does it. In practice, in the cases I've observed, right now you pay for 5 forces.

    On resources with different disks...this can increase your throughput, but won't make a big difference from the point of view of the invoker. The invoker is going to still see the same number of forces. Higher parallelism just means more people can force at the same time.

    \Rowland Smith\
    After both resource managers finish preparing and vote yes in response to the prepare() call, the TM needs to record that the transaction commited in its log, and force its log to disk. After the TM has recorded the sucessful commit of the transaction, it can return from the commit() call issued buy the application, and the application can continue processing. The call to commit() on each of the XA resources can be performed afterwards (hopefully soon afterwards since the resource managers need to hold onto locks until the transaction is commited) by the TM. But the commits can be performed in parrallel with other application work, and not keep the application waiting for 2 additional resource manager disk forces.

    This works because once the TM records that the transaction has committed, the transaction is going to commit. Heuristic failures are possible, but there is nothing that the application can really do about them. Any heurisitic failures should be reported by the TM to responsible individuals who can try to sort out the mess, but there is no need to report them to the application.
    \Rowland Smith\

    In my opinion this is an incredibly bad idea. Telling an application that a transaction has committed, when you really mean "it will commit eventually", can lead to all sorts of problems. It's trading correctness for speed. Many applications are coded to make certain assumptions about successful commits, most particularly in areas where the application can "know" that it owns the updating of particular tables exclusively (think of a cache). Beyond locking issues, any notion of ordering could be easily shot to hell.

    The first idea does have alot of merit though, and in fact has been discussed on TSS several times. I just wish more app servers implemented it. If you parallelize prepare() and commit() calls (so that all prepares() are done in parallel, you reap the results, force to TM log a Committing... record, then do all commits() in parallel and reap again) then you have a cost of 1 TM disk force, the longest resource prepare(), and longest resource commit(). Best of all, adding in an extra resource doesn't add in more disk forces.

         -Mike
  38. A TM could indeed issue the prepare() and commit() calls in parallel to all resources. One person from BEA claims that Weblogic does this, but other than that one post I haven't seen that confirmed by anyone else or from any literature. I don't know of any other app server that does it.

    >

    FYI, the Bluestone Total-e-Server/HP-Application Server did this.

    > The first idea does have alot of merit though, and in fact has been discussed on TSS several times. I just wish more app servers implemented it. If you parallelize prepare() and commit() calls (so that all prepares() are done in parallel, you reap the results, force to TM log a Committing... record, then do all commits() in parallel and reap again) then you have a cost of 1 TM disk force, the longest resource prepare(), and longest resource commit(). Best of all, adding in an extra resource doesn't add in more disk forces.

    Again, HP-AS did this for precisely the reasons you mention.

    Mark.
  39. XA Exposed, Part II[ Go to top ]

    BTW, nice article. Forgot to say that about Part I too, but ditto.

    Cheers,

    Mark.