Do anyone have experience in using EJB to execute batch job(job need to execute more than 10mins)? Does it perform? Or, it will use up all the system resources? Also, does EJB only suitable to perform online job instead of batch job?
Next, is there any recommendation of "batch architechure package" available for EJB/Java?
Thx much in advance!
We developed an EJB/XML architeture for batch updating/synchronizing gift registries about 1 year ago and put the system into production. We relied on the Session wraps Entity pattern (using a stateful session bean to manage the batch updates and mediate between the Entity EJB layer). Our longest batch update lasted about 58 hours (no kidding) and processed about 24K XML transactions an hour. We did not experience any problems related to EJBs.
One consideration for our project centered on finding the right "chunking" size for incoming XML payload to optimize processing throughput. We finally determined that ~10M chunks of XML produced the best results for our J2EE system. The deployment hardware also was an important consideration -- our production machine was a quad Sun 450 with about 2G of memory.
We discuss a bit about the J2EE architecture in slides posted on this site (Click Hits Brick). Please feel free
to email us with more specific questions if we can be of help (thomasamarrs at home dot com or ayers at zti dot com)
Well, I guess if you can afford to do batches for 58 hours, then you can go ahead and use XMLs and "Session wraps Entity" and other cool stuff like that.
Other than that, you can use SQL Loader from Oracle or other equivalent utilities for other database, C/C++ feeding SQL Loader, C/C++ with Embedded SQL preprocessor, as a last resort PL/SQL or equivalent, and get the job done the proper way.
I wouldn't be so proud for 58 hours for 24K "XML Transactions".
If you mean by that 24K * 10M =240M on a 4x450MHZ Sun with 2G RAM and probably superfast SCSI disks, then it is really nothing to be proud about and recommend to others.
My apologies...I was not very clear in my original post but point well taken. We processed 10M chunks every about every 3 minutes -- by SCSI standards this might seems slow! :-) One XML transaction translated to about 12 DB transactions on our J2EE system (~488 DB transactions per minute) which still seems slow compared to native systems. Considering though it was sychronizing / transforming two very different gift registry systems in a platform neutral-way everone involved was very happy. Average synchronization times for our retailers only lasted about 20 minutes or so. That said, the marathon update I mentioned was a somewhat unique event that really tested the limits of our system for an initial upload of a retailer with over 240 stores. Please see our slides to put some of this in context. Business problem was king on this project.
:-) make that ~4,800 per minute! But the *real* point and the one that I was trying to ellucidate is that one can have a very successful production system with the J2EE architecture -- one that can handle a "batch update" scenario under stress for long periods of time. Once again, it will always depend on the business requirements and picking the right tools and architecture to meet those goals. The project I mention went into production in January of 1999 -- things have only improved since then!
I wasn't trying to be harsh but your initial post presented things in a bad light.
So one has to weigh in if it complexity and flexibility of data processing against performance and real time factors.
The first tend to put an advantage on java, while I fail to see why one should use EJBs instead of simple Java with JDBC which offers the same capabilities without any unnecessary overhead.
But if you're in for performance and don't have big hardware to afford wasting CPU, then Java is not a recommended solution.
The "use database tools to load data" option poses some problems also.
1) You cannot tell your server it has exclusive access to the database (so entity beans will flush / read themselves every method call, which can cripple performance.)
2) You have to take the server down while you run the batch job (not very e-age!) :-)
The performance of batch jobs is one of the most crippling factors in getting financial markets from T+3 to T+1 trade settlements and so is a subject of much debate. As often as possible, people want to have no batch jobs at all, everything runs as and when it happens. Good idea, but not always possible in every situation.
I would tentatively suggest that using EJB for batch jobs is "OK" provided it doesn't lock so many entity beans, for so long, that the rest of the system can't get at the data it wants to see. This is more often a problem with servers like WebLogic where the pessimistic concurrency can really hurt (although if you have transactions wrapping these big updates then it's an issue for any concurrency model.)
Just my 2c worth.
we're considering how to reconcile "batch" processing with J2EE on our project also. our primary concern is our DB loading strategy. when a batch job fires off and detects the presence of new transaction data (basically CSV) to be loaded, the data gets loaded via a bulk loader provided by our RDBMS into a "work table". the work table consists of little beyond a record ID, raw record data column, and a status indicator: Loading, Processing, or Done. a given chunk of transaction data to be loaded either all gets loaded or none at all.
once raw transaction data is loaded into the work table, processing on that raw data can occur (parsing, partitioning into domain tables).
the work table is there to facilitate restarting of processing on the raw data. if processing of these records cannot continue for some reason (problems more serious than, say, ill-formatted data), they remain in the work table for subsequent invocations of processing to try again.
once all that raw transaction data is loaded into the work table, we would like the ability for multiple threads to be able to request a chunk of some fixed number of raw records to be processed (validated and stored in the database tables representing the domain concepts). it might be tempting to create a session-wraps-entity bean team to ask for chunks such that each requestor gets a set that is disjoint from those that other requestors receive. but i'm unsure of how that could be guaranteed. would beginning a transaction for doing that fetch be sufficient to block out other requestors from receiving intersecting sets of raw processable data?
it seems like a whipping to do this, though, if for no other reason than you're creating entity beans for raw, unprocessed transaction data, which aren't true domain concepts. they're just to facilitate restartable batches.
so we have a DAO in the client is responsible for fetching chunks from the work table, and these chunks are then handed to a session-wraps-entity team for parsing/processing into domain tables.
this world of batch processing is fairly new to me, and understanding J2EE's role in it is new also. our strategy so far has been to leave the "batchy" stuff (blocks, bulk loading) on the periphery of J2EE, and let J2EE do more of the "domain-y" stuff (processing/managing domain data, fulfilling business processes). the "batch" processes involve mass data loading, but the mass data loading isn't proper business logic, so we're keeping that concern separate from J2EE. of course, the processing of the raw transaction data will include creating new entity beans... but the processing of such a record to me is a little different from "bulk-loading" it somehow.
any suggestions are most welcome.