Discussions

J2EE patterns: EJB Checksums instead of Version Number Pattern?

  1. I understand the current Version Number pattern to be the following:

    The objective of the Version Number pattern is to detect when data used to update the server is stale. Data read in a previous transaction may have been updated since the read completed. As an example, data may be read and passed back to the client in a data transfer object (DTO, otherwise known as a Value Object). A client may modify the data in the DTO and initiate an attempt to save the data. It is possible that other activity has changed the data supplying the DTO data. This means that there is a problem since the client’s attempt to modify data is based upon a DTO containing stale data. The Version Number pattern addresses being able to detect and respond to situations, like this, where the data is stale. The Version Number pattern uses optimistic concurrency, i.e., the client may assume that its updates are based upon the latest version of the data. If this is not the case, then the client is notified and has the options of retrying or abandoning the update attempt.

    This pattern is particularly well suited for situations in which a use case spans multiple transactions (i.e., there is not transaction isolation). An example would be gathering data in one transaction, the client updates the data, and the server data is updated under a second transaction. Without transaction isolation, there is no way of guaranteeing that the data is not changed (between the two transactions) by other transactions.

    The Version Number pattern may be implemented by adding an integer (version number) to entity beans as a member attribute. A version number also has to be included in tables accessed using the JDBC For Reading pattern. The version number tracks the state of the data at any point in time. The integer is carried around with the data that is to be updated (both to and from the server side). The version number is incremented when server side data is updated. Prior to an update, the version is compared with the entity bean version. If there is a mismatch between the current version and the entity bean state, then stale data has been detected. An example would be a client receives a DTO with version number 4, other transaction(s) update the server data under the DTO so the server version number is 5, the original client updates its DTO and sends it back to the server for an update transaction. Comparing the version numbers in the DTO and entity bean will indicate stale data because there is a version number mismatch. The DTO’s version number is 4 and the entity bean’s current version number is 5. An update is only allowed when the server’s current version matches the DTO version number. If there is a mismatch between the compared version numbers, then an exception is thrown indicating state data. The remedy is for the client to reapply the updates after retrieving the latest data. If the client and server version numbers are equal, then the entity bean is updated and its version number is incremented.

    The implementation of the Version Number pattern varies depending upon whether remote or local interfaces are used. In the former case, the entity bean’s setDTO method the version number could compare the DTO version number and its current version number and invoke the fine grained setter methods if the DTO is not stale. In the latter case, a session façade must manage the version number comparison. The version number maintains consistency even under the rare circumstance that updates take place at the same. The second transaction will either not be updated because of a detected version number mismatch or the database will be rolled back by the application or database.

    The implementation involves:
    - Adding a version number column to database tables
    - For remote interfaces - Adding a small amount of code in the entity bean (setDTO method)
    - For local interfaces - adding a small amount of code to session façades and entity beans.

    Here is where the possible checksum pattern starts:

    The scenario that is described above where the "get" happens in a different transaction than the "update", when concurrent access is concerned, is commonly known as a "ghost update" (I guess b/c the first update disappears like a ghost). We ran across this exact same problem when we created API's into our legacy app, but we handled it in a different way than adding a "version number" to all of the database tables.

    The "design" we used could be referred to as checksum calculation. A checksum can really be any datatype based upon the algorithm used to create it, but we used a quadword (64 bits). When a record is read from the database, the entire record (in memory represented from byte 1 to byte n, contiguous) is ran through an MD5 algorithm (java supports these type of algorithms in one of it's libraries) to create this checksum. The MD5 algorithms are designed in a such a way that if a single "bit" was changed in a database record field, then checksum would be different (kind of like how the checksum digit on our credit cards work ... to keep people from forging fake credit card numbers). This checksum is a field that we included in something we called a "handle" (this is a primary key for an entity bean in J2EE speak). This checksum stays with the handle (and therefore with the client). When the client then wants to update the data, the handle is passed back to the update code. The first thing the update code does is calculate the current checksum of the record getting ready to be updated, then it compares that checksum with the checksum in the handle. If they don't match, it means that the record was modified since the client last read it, the transaction is rolled back and then handled appropriately. The MD5 algorithms are super efficient and this process adds no noticeable performance issues b/c the transactions for the calculations have to be made anyways.

    It seems to me that some kind of checksum scheme in the PK of an entity bean (stored in the container itself) would make this stuff work great ... maybe EJB 2.1. Then again, maybe somebody already thought of this and I didn't search the site well enough.

    Threaded Messages (23)

  2. The only problem with this scheme (aside from the minimal added overhead of the MD5 algorithm) is the possibility that two different record values could have the same checksum, in which case the first update would get overwritten. It’s a very remote possibility, but a possibility nonetheless.
  3. You checksums pattern is a good idea.

    We actually has used this pattern in our Object/Relational Mapping product "ServadoDB®", based on our design strategy, I want to add some more points:

    1. This pattern is not only suitable to Entity EJB. If we use normal javabean+O/R Mapping. We can implement this by storing the hash information in the java bean class.

    2. Sometime, we want to maximuze the concurrency of different transactions, we want to reduce the locking level to particular fields instead of whole records. For example, first transaction loads a employee bean and just want to update its salary field. The second transaction loads the same bean and updates the name field. After that, the first transaction submit the update. If we store the hash of the whole record, then the first transaction won't pass the checksum checking and will be rejected. So our design strategy is to store the fields mask which need to be updated in addition to the checksum of the selected fields values. In this case, the checksum checking method will only check the fields of the selected fields instead of the whole record.
  4. Jianwu,
    In your second option,I'd think you'd want to flunk the transaction if anything had changed, not just the field that you were modifying. The client would not be aware of the state of the data after the first transaction is committed, and the two updates might not make sense in conjunction.
    Just trying to understand why this is not a problem.
    Thanks,
    Beth
  5. Nice pattern!

    How do you get it to work for Entities that have more complex data types as fields, or relations? I.e what if one of the fields is a large relationship collection of local interfaces to another ejb bean? How far down the tree do you calculate the digest for?

  6. Keeping in mind my previous comment that there is a chance that two different rows can produce the same checksum, you could also achieve this effect by xor-ing all of the objects' hash codes. To get 64 bits out of it, create two 32 bit checksums by xor-ing two exclusive subsets of your fields' hashcodes. Shift the first value left 32 bits and then add the second. Presto, near-evenly distributed 64 bit checksum.

    The version pattern is still the only safe way to go out of the two.
  7. We are using a version number pattern on our system. One advantage that using a database field has is that the check can be performed on the database, which should be much faster than doing an algorithmic check. For example, if we have an update, we call a procedure with the code:

    update my_table
    set
     [values]
    where
      my_table.id = @my_id
    AND
      my_table.version = @version

    Then we check to see the rows affected. If no rows were affected, then there was no match on the join, so the version numbers were different (we actually return a parameter indicating that there was a version number conflict). I haven't done tests but I believe that using criteria like that in a join should be a lot faster than an algorithm in application code. This method also prevents you from having to do a lookup of the entire existing record to calculate a current checksum value, or from even having to look up just the current version number.

    Note that I don't think this would work in all cases; if you're not using BMP, I don't know how you could do something similar.
  8. In a CMP case (or even in a BMP case) we use the version numbering approach as follows:

    In case of updating a bean, instead of a findByPrimaryKey method, we use a findByPrimaryKeyForUpdate method, which takes two parameters: the primary key and the version. If the record with these given values cannot be found, it means that the record is changed! With this approach, we can do more: in the finder method, we can lock the record for an update in the DD, e.g.

    where ID = :arg1 and VERSION = :arg2 for update

    Hope this helps...
  9. REF: CMP update / "where ID = :arg1 and VERSION = :arg2 for update"

    Sorry,

    but how can you call setRollbackOnly() transaction properties
    when UPDATE return zero rec. updated?

    I suppose that you must hand-write EJBStore method.

    TIA
    Roberto Colmegna
  10. The odds of creating the same 64 bit checksum from two different rows in a database table is near impossible ... I wouldn't let this deter you from using such a pattern.

    I'm not a DBA, but from past experience I know that database "joins" on two or more tables are pretty "expensive" and require significant disk I/O's. If the entire database record(s) were already in memory then a checksum calculation should be faster.

    For the argument that such a pattern requires the whole row to be read into memory ... I recognize this as a potential downfall ... this is what we had to do ... definitly many ways to create checksums.

    I wanted to bring this topic to light because currently entity beans cannot inherently prevent "ghost updates" for a "use case" (I hate using that phrase) that spans more than one transaction without some kind of applied pattern. I think that such an option should be included in any new EJB spec (deployment descriptor option), and a checksum calculation is the way to go.
  11. Assuming the checksum algorithm is good (even distribution throughout entire checksum space) I 'think' the probability that you will have an update with the same checksum as the previous checksum is 1/(2^64), which is an incredibly small number.

    The probability that the value will randomly change on its own on the hard drive (and not be detected by the much weaker HD checksum methods) is much much higher.

    All operations on the computer, over a network, etc are subject to random errors (due to electrical noise, quantum 'bad luck', etc) and checksums and error correction codes are used to reduce these to an 'acceptably' small level, however the 'risks' are still there. We forget about these risks in day to day life as we unconsciously think about computers as being infallible calculating devices, but the truth is there is always a possibility that a random error will just spontaneously show up ;-)
  12. I'm not a DBA either, but I believe that Drew is talking about adding the version number as a condidition which is alot different than doing a join. And since when you use the version number pattern you need to check the value in the DB anyway, I think looking at the rows affected value while updating is a smart way to go.
  13. Sco,
    The conditional update using AND is not a join (only one table is involved). The database looks up the row by index on ID and then merely checks that the VERSION value matches...can't get any faster.
  14. Hi guys,

    FYI - people using Weblogic 7 have automatic checking for stale data.

    checkout..

    Ismail
  15. I think what weblogic 7 offers is again preventing stale updates within a transaction. For example say you start a transaction and ejbLoad is performed and then you perform an update. Before the transaction is committed the data is changed, then weblogic would detect it.
    Here we are I think talking about lost updates problem across different transactions.
    Person A fetches (in diff transaction)
    Person B fetches(in diff transaction)
    Person B updates before Person A does (in diff transaction)
    Person A when goes to update should get an error (in diff transaction)

    Regards
    Mohit
  16. Mohit, As far as my understanding of EJB goes EJB container takes care of the transaction management. And we can configure our EJB's based on our requirements. I do not understand the need of transaction management.

  17. Alka,

    The EJB container does manage concurrent access within the transaction boundaries, but you do not want to let your transaction span user interaction. The user may never complete the interaction.

    Therefore, you end up with a situation where the user fetches data in one transaction and, later, updates it in another interaction. You want to verify that the data the user fetched in the first transaction is still valid prior to the update. If some other user has changed the data in between, you could be overwriting their changes without knowing it (changing the same field). Or, you could be making changes incompatible with their changes (changing a different field).
  18. the approach is fine but what doubt I have is, if ur not storing the hexadecimal value in database...then..if the same database is accessed by some other lagacy appl and that appl modify the data how is consistency going to be maintained ...and if this hexadecimal value is going to be stored in DB then how is it different from Version pattern.
    The same trick is working in that also..
  19. The value in the checksum is that it is not stored with the data. It is calculated when needed.

    With that said, it will be much faster to do a single update that either succeeds or returns a no rows changed if the version number does not match. Than to re-read the current data from the DB, generate a checksum, compare and then update if the checksum matches!

    As others have noted there is no join involved and no additional read to do the update like:

    update my_table
    set [values, version = @version + 1]
    where id = @my_id AND version = @version
  20. Seems like what you want is a 2-phase commit. All your talking about is some convention for describing what your commiting.
  21. A two-phase commit (2PC) is not going to solve the underlying problem: the reason why this pattern (or Version Number) become necessary is that you want to keep database transactions short, i.e. normally you wouldn't want them to include user think time. 2PC is all about co-ordinating multiple transactions.

    So instead of using database transactions (and therefore pessimistic concurrency control) you want to use some form of optimistic concurrency control that by design does not keep a database transaction open during the full business transactions.

    Details on this can be found in the Version Number pattern in Floyd's book or in the sections dealing with optimistic concurrency control at http://www.martinfowler.com/isa

    Oliver

  22. Context:
    - an EJBEntityBean with data for N-bytes

    Hypothesis:
    - MD5 checksum to solve ghost UPDATE problem

    Analysis:
    - MD5 algorithm must execute, at least, N
      operation to compute hash-key (ie O(N)
      computation load)
    - MD5 algorithm is implemeted via java-class
      (interpreted bytecode -> slow)
    - there is a (very small) risks of hash-collision
    - clustered app. server create a concurrency
      risks due to distribuited multiple EntityBean copy
      and app. server cache

    Counter Analysis:
    - DB server have a lot of high-speed string
      comparation C/ASM routine

    Conclusion:
    - build an UPDATE which compare all fields as
      condition. ie:
      UPDATE mytable WHERE id=10 AND info1=? AND
      info2=? AND ...

    Problem:
    - in CMP persistance how can setRollbackOnly()
      when UPDATE failed (zero record updated)?

    Solution:
    - hand-writed CMP ejbStore method.

    Thanks for every comment (... and excuse me
    for my bad english!!! ;) ).

    Bye
    Roberto Colmegna


  23. how about using timestamps ??[ Go to top ]

    Hi everybody,

    But IMHO this discussion is really important. There can be many issues and solutions to it.
    First up, it depends upon the granularity. I mean what fields need to be checked when updated. The application need not require to check other fields which are not critical. In this case a check will have to be made for the particular column when it was updated. But this will require additional colums for the check.
    Also a single timestamp column can be maintained for the entire row. Just b4 an updated the timestamp values can be compared. Accordingly further action can be taken.

    For the first scenario:
    we can have 2 additional columns to chk for column update:
    column_id - to store the column id to be checked
    version_no - the last update no of this column

    For the second scenario:
    just maintain a version no.

    b4 updating perform a find on the PK and fetch col_id and ver_no (case 2) and match with the existing ones .. thus we can find if the col is updated by other client.

    The ver_no can be numeric or Timestamp.

    Weblogic does not explicitly take of such things. It just checks whether the EB data is modified b4 commit. It will not take care of the issue when 2 ppl read the data and modify it. It will persist the last update.

    I am awaiting your comments on this.

    regards
    Jani Prashant (janiprashant at yahoo dot com)
  24. how about using timestamps ??[ Go to top ]

    Time stamps Great idea one flaw timestamps differ over databases
    so if you are building an application specific to one database
    there would be problems moving to another database


    i.e. in oracle the format is '2002-07-14 11:07:00.123456'
    and in say DB2 its'1994-01-01-10.02.30.006123'

    in db2 the format YYYY-MM-DD is combined using a hyphen ("-") to the time component in the format hh.mm.ss.zzzzzz. a subtle but unfortunate difference


    but this could worked around and if you never have intentions of moving to another database, why not use the timestamp pattern.