Article: Spring Batch Overview

Home

News: Article: Spring Batch Overview

  1. Article: Spring Batch Overview (43 messages)

    Spring Batch is a module for Spring that provides the capability to use Spring's dependency injection feature in a batch-processing environment - through tasklets, modeling the functionality COBOL's brought to high-volume computing for decades.
    Despite the growing momentum behind SOA and real-time integration, many interfaces are still flat file-based, and therefore, best processed through a batch mode. Nevertheless, there is no industry-standard or even de facto standard approach to Java-based batch architectures. Although there are the occasional articles on Java batch architectures, they are rare and far between. Batch processing seems to be a critical, missing architectural style and capability in the marketplace. Also consider that:
    • Despite the growth of SOA, there is still demand for a high volume batch architecture that automates complex processing of large volumes of data and/or transactions most efficiently processed without user interaction.
    • Batch jobs are part of most IT projects and there is currently no commercial or open source java framework that provides a robust, enterprise-scale solution/framework.
    • Lack of a standard architecture has resulted in the proliferation of expensive one-off, in-house custom architectures.
    • Batch processing is used to process billions of transactions everyday within mission-critical enterprise applications.
    Despite the lack of standard architectures for batch there are decades worth of experience in building high-performance batch solutions. Spring Batch has been started with the idea and focus to consolidate this thinking into an open source project that can be offered as a hopeful de facto standard within the Java community. The same architecture principles that have made Spring so popular are as enriching to batch architectures as they are to online, SOA and any other Java Architectures.
    Read Spring Batch Overview.

    Threaded Messages (43)

  2. Re: Article: Spring Batch Overview[ Go to top ]

    Batch jobs are part of most IT projects and there is currently no commercial or open source java framework that provides a robust, enterprise-scale solution/framework.
    I'm not going to try to hijack this thread, but this statement about there being no Java framework for batch jobs is just not true. :) How long have Quartz and Flux been around? Quartz since around 2001 and Flux since 2000. Obviously, Spring Batch != Quartz != Flux and they all are different in different ways, but they do all certainly address the topic of batch processing, jobs, and job scheduling. Anyway, back to the Spring Batch discussion. Cheers, David Flux - Firing Java jobs since 2000 :)
  3. Re: Article: Spring Batch Overview[ Go to top ]

    Batch jobs are part of most IT projects and there is currently no commercial or open source java framework that provides a robust, enterprise-scale solution/framework.


    I'm not going to try to hijack this thread, but this statement about there being no Java framework for batch jobs is just not true. :)

    How long have Quartz and Flux been around? Quartz since around 2001 and Flux since 2000.

    Obviously, Spring Batch != Quartz != Flux and they all are different in different ways, but they do all certainly address the topic of batch processing, jobs, and job scheduling.

    Anyway, back to the Spring Batch discussion.

    Cheers,
    David

    Flux - Firing Java jobs since 2000 :)
    Yes, it will be nice if the author or someone knowledgeable can give a take on how these 3 frameworks are different/same and in what contexts each make a better sense. thank you, BR, ~A
  4. Quartz != Spring Batch[ Go to top ]

    I found the following by clicking the link to the Spring Batch information at the top of this page.
    How does Spring Batch differ from Quartz? Is there a place for them both in a solution? Spring Batch and Quartz have different goals. Spring Batch provides functionality for processing large volumes of data and Quartz provides functionality for scheduling tasks. So Quartz could complement Spring Batch, but are not excluding technologies. A common combination would be to use Quartz as a trigger for a Spring Batch job using a Cron expression and the Spring Core convenience SchedulerFactoryBean .
  5. these are not batch systems[ Go to top ]

    Neither quartz nor flux are full flegded batch systems . they have lot of provisions to do prallel and multi threading but spring is one framework catering to complete requirements of an enterprise batch framework .
  6. Re: these are not batch systems[ Go to top ]

    Neither quartz nor flux are full flegded batch systems
    Care to elaborate? I haven't seen anything that Spring Batch can do that Flux can't. Peace, David Flux - Java Job Scheduler. File Transfer. Workflow. BPM.
  7. Re: these are not batch systems[ Go to top ]

    Neither quartz nor flux are full flegded batch systems


    Care to elaborate? I haven't seen anything that Spring Batch can do that Flux can't.

    Peace,
    David
    One of them is free and open source, and the other isn't.
  8. Re: these are not batch systems[ Go to top ]

    I haven't seen anything that Spring Batch can do that Flux can't.
    David I haven't seen that Spring Batch can do... that any modern language can't do. The difference is just a matter time... Spring Batch is simply... aimed at batch execution. It's not process, it's not scheduler, it's not BPM. It's just batch execution engine. As simple as this. What Batch execution is? Having worked little bit with z/OS and JCL - spring seems like JCL written the Java (or rather Spring) way. See http://en.wikipedia.org/wiki/Job_Control_Language for more info. Job scheduler like Flux is not competition for Spring Batch, they can coexist, just doing different things.
  9. Re: these are not batch systems[ Go to top ]

    Hi Artur,
    I haven't seen that Spring Batch can do... that any modern language can't do. The difference is just a matter time...
    Yes, you're right, of course. :) Most everything in the Java universe gets started with a Java compiler. I once worked on a project where the customer wanted to build a very flexible system. Facetiously, we developers would sometimes say that instead of us writing the app, we should just give the end user 'javac'. :) Tariq wrote:
    Neither quartz nor flux are full flegded batch systems.
    This was an overly broad, simply incorrect statement that I just couldn't let pass without comment. Both Quartz and Flux are used at Manhattan and London financial institutions today as batch engines. To say that neither are full-fledged batch systems is wrong.
    Spring Batch is simply... aimed at batch execution....z/OS and JCL...
    I wholeheartedly agree with you that Spring Batch takes its terminology and structure from the traditional, mainframe job scheduling world. Like you said, Spring Batch and Flux do not compete head-to-head. My only point was to state that Quartz and Flux are indeed full-fledged batch systems. OK -- back to the Spring Batch discussion. Has anyone played with it? What did you think?
  10. Re: these are not batch systems[ Go to top ]

    Has anyone played with it? What did you think?
    Me not, so far :). I'm extremely interested in such a functionality in Java World. Unfortunately Spring Batch is kinda ghost project within Spring Project. I know some really big companies behind this, Spring umbrella etc. I expected ready, well-quality docs here, open development process (by open I mean - healthy community, some milestones, clear release cycles) - You know all that stuff what Spring is famous of. Not in this case. When You visit http://www.springframework.org/spring-batch the newest info here comes from April 2007. Virtually no resources, no downloads, no documentation. Quite strange thing in The Spring Portfolio case... Once there is documentation, official downloads I will pay with Spring Batch...
  11. Re: these are not batch systems[ Go to top ]

    When You visit http://www.springframework.org/spring-batch the newest info here comes from April 2007. Virtually no resources, no downloads, no documentation. Quite strange thing in The Spring Portfolio case... Once there is documentation, official downloads I will pay with Spring Batch...
    There IS plentiful material in the site, you should go to http://static.springframework.org/spring-batch/. Maybe spring guys shall be notified on this.
  12. Is this a Java SPAM/Fanboy site or what?
  13. If you actually *read* the article you see that its authors are David Syer and Lucas Ward. Anything fact-based that you'd like to contribute to the discussion based on what is presented in the article ?
  14. Is this a Java SPAM/Fanboy site or what?
    As site editor, I think he's entitled to post what he likes.
  15. Re: Article: Spring Batch Overview[ Go to top ]

    Batch jobs are part of most IT projects and there is currently no commercial or open source java framework that provides a robust, enterprise-scale solution/framework.


    I'm not going to try to hijack this thread, but this statement about there being no Java framework for batch jobs is just not true. :)

    How long have Quartz and Flux been around? Quartz since around 2001 and Flux since 2000.

    Obviously, Spring Batch != Quartz != Flux and they all are different in different ways, but they do all certainly address the topic of batch processing, jobs, and job scheduling.

    Anyway, back to the Spring Batch discussion.

    Cheers,
    David

    Flux - Firing Java jobs since 2000 :)
    Actually, just reading the Spring Batch roadmap, I think a comparison between Spring Batch and Hadoop would be more useful. Kit
  16. Re: Article: Spring Batch Overview[ Go to top ]

    Not to hijack the thread but if you want the ability to parse and write fixed-length record data in Java using the COBOL cpy format with automatic conversion to standard Java types, the open-source library called cb2java will allow this without code generation of intermediate formats such as XML or redundantly redeclaring the format of the file. The library also provides the ability to use just the cpy layout for developing other tools such as code generators.
  17. credibility++[ Go to top ]

    This is a pretty good example why i'm ready to trust my career on the Interface 21 guys. There could not be anything un-sexier than batch processing, yet these guys still do it, because it's needed in real life. At the same time the JBoss team is trying to sell us 'dynamic depencency bijection' (as per Seam) and loads of other sexy hype words. I have to give the Inteface 21 team a +1 here for once again thinking straight and concentrating on the real world needs. /Henkka Karapuu
  18. Batch jobs are part of most IT projects and there is currently no commercial or open source java framework that provides a robust, enterprise-scale solution/framework.
    Maybe you need to take a look on ComputeGrid by IBM: http://www.ibm.com/developerworks/websphere/techjournal/0707_antani/0707_antani.html http://www.ibm.com/developerworks/grid/library/gr-appprogdisprtxd/index.html Ciao, Diego
  19. Spring Batch: the prequel[ Go to top ]

    Before it was Spring Batch it was called "something else". And I had the ahem "privilege" of using that "something else". And that was possibly the worst batch framework that ever existed. And that "something else" was rewritten over and over and could never be used in production without alot of trouble. That "something else", is now called: Spring Batch. good luck batchers.
  20. Re: Spring Batch: the prequel[ Go to top ]


    And that was possibly the worst batch framework that ever existed.
    I am currently evaluating this framework. As far as I can see, the ideas behind it are good, and I do not see any problems around its design. Its boundaries (what it delivers and what it does not) are almost clear, and it integrates well with well-known concepts (tx management, task executor, etc). Documentation is still poor, but this is common before 1.0 release; I appreciate the idea of sharing the project and its goals before it is finally released, so that people can participate and share their needs before it's finalized. Can you tell about the problems you encountered while using it, that led you to this point of view ?
  21. The batch playing field...[ Go to top ]

    To have a meaningful discussion about the technologies within the batch domain, I think that we need to clearly lay out the layers of batch processing. There are four of them: Schedulers, Batch Execution Environments, Batch Application Containers, and the actual batch applications. 1. Schedulers; schedulers manage job dependencies, resource dependencies, scheduled submissions, and some form of job lifecycle and execution management. Quartz, Flux, and other such open source schedulers provide time-based scheduling and some form of dependency management. Tivoli Workload Scheduler, Control-M, Zeke, and other schedulers however provide more scheduling features and are typical products found at bigger customer shops. These shops have built complete batch infrastructures around the scheduler including security models, auditing mechanisms, archiving, and so on. 2. Batch Execution Environments (BEE); they host batch application containers, and provide features like: transaction management, checkpointing, recoverability, security management, connection management, scalability, high availability, output processing, and so on; the inherent qualities of service and integration with existing schedulers are provided by the BEE. XD Compute Grid Delivers a BEE. 3. Batch Application Containers provide a well-formed invocation model for the business logic. The container manages the lifecycle of the application and gives control to the underlying transaction manager, security manager, etc as needed. XD Compute delivers a batch application container. I would argue that Spring Batch is a Batch application container too. Spring Batch doesn’t provide a transaction manager, security manager, explicit high availability, and so on, but it does allow them to be injected into the container and therefore available to the application. 4. Batch Applications that implement the actual business logic and run within a batch application container. Nothing special to discuss right now, perhaps portability among containers in the future Thanks, Snehal Antani
  22. Re: The batch playing field...[ Go to top ]

    2. Batch Execution Environments (BEE); they host batch application containers, and provide features like: transaction management, checkpointing, recoverability, security management, connection management, scalability, high availability, output processing, and so on; the inherent qualities of service and integration with existing schedulers are provided by the BEE. XD Compute Grid Delivers a BEE.

    3. Batch Application Containers provide a well-formed invocation model for the business logic. The container manages the lifecycle of the application and gives control to the underlying transaction manager, security manager, etc as needed. XD Compute delivers a batch application container. I would argue that Spring Batch is a Batch application container too. Spring Batch doesn’t provide a transaction manager, security manager, explicit high availability, and so on, but it does allow them to be injected into the container and therefore available to the application.
    What in the above isn't already available from any number of sources in Java? The above features seem to me to be what you need to make a group of COBOL program behave as a proper enterprise quality program. Why do we need to re-implement this architecture in Java? Assuming you are responding to my question, I think it's not really helpful to start discussing these features. Correct me if I am wrong but where is nothing in the above that requires running applications in batches. The real distinction is how the application is executed: in a scheduled batch runs or in an event driven fashion. My personal experience so far has been that batch is used mainly because of a lack of understanding of the event based model. I'd like to understand what valid reasons a batch would be used other than efficiency. Obviously if no processing can be done until all the records are available, the batch model is the most natural one but I don't see this as a special architecture.
  23. Re: the batch playing field[ Go to top ]

    Hi James, sorry, I wasn't responding directly to your question but rather trying to clear up the discussions that took place earlier in the thread. For your points, many customers I work with have business reasons for migrating from COBOL to Java. It depends on a variety of factors including the underlying runtime that the batch jobs are executing on. For example, the emergence of specialty processors on System Z: zAAP, zIIP, IFL have influenced the strategic direction of middleware environments. zAAP processors, for certain customers, provides a clear business reason for moving to Java. For other customers, it may be the lack of COBOL programmers in the marketplace. Then again, I know of customers that are not only thrilled with their existing COBOL-based batch but continue to invest in building it up. It all depends on a variety of factors. Event-based batch processing I'm sure has a place in the batch world, however I don't think it would be accurate to say that it is an end-all architecture for batch. For example, it is well known that JDBC batching can dramatically improve performance. If performance is a key requirement for a customers batch infrastructure, how does one apply JDBC batching, maintain transactional integrity, and avoid problems with timeouts if they've applied an event-based model? Furthermore, how would one manage an event-based batch model: start, stop, cancel batch jobs. Finally, using an event-based model tends to imply that each event will trigger a global transaction which ensures data integrity. The overhead of a global transaction would be significant when processing thousands or millions of records. The initial questions posted in this thread related to the positioning and role of Spring Batch among existing batch technologies. I'm not sure a discussion of a batch "reference architecture" is one that could be carried out in this type of forum. Thanks, Snehal
  24. Re: the batch playing field[ Go to top ]

    Event-based batch processing I'm sure has a place in the batch world, however I don't think it would be accurate to say that it is an end-all architecture for batch. For example, it is well known that JDBC batching can dramatically improve performance. If performance is a key requirement for a customers batch infrastructure, how does one apply JDBC batching, maintain transactional integrity, and avoid problems with timeouts if they've applied an event-based model? Furthermore, how would one manage an event-based batch model: start, stop, cancel batch jobs.
    We run OS400 for a number of our core systems and the operating system not only supports Java but when you run local Java applications on OS400, you can use a native JDBC driver that connects directly to the DB without the network overhead and use other features of the platform natively such as XA. In this environment we can use message queues to move transactions to the data. When the queue receives messages, it can start a Java application that will attempt to drain the queue and wait for a specified amount of time for more messages. It can then batch these messages together or just use cached PreparedStatements as the native JDBC driver is very fast compared to network-based JDBC connections. The benefit of the latter being that one failure doesn't spoil the whole batch. Stopping and starting is not a difficult thing to accommodate as the queue will retain the transactions until they are committed by the retrieving application. What you get is a design that will batch more when under larger loads and less when loads are small and can be set as at a low priority in order to avoid interfering with more immediate needs. But when there are free resources, it will process transactions instead of remaining idle until an arbitrary amount of time has passed. We can get all of this without any framework, EJB, scheduler, etc. Just plain old Java.
    Finally, using an event-based model tends to imply that each event will trigger a global transaction which ensures data integrity. The overhead of a global transaction would be significant when processing thousands or millions of records.
    It may just be that I've not had the proper experience to understand this need but I've not run into a situation where a global (across-the-enterprise) lock was required. By 'required' I mean that there was no possible way to remove the need for a global lock through design changes. I would really like to understand this concern as it might help me to understand when batching is really a requirement. Any design that requires a global lock is inherently impossible to scale horizontally, is it not?
    The initial questions posted in this thread related to the positioning and role of Spring Batch among existing batch technologies. I'm not sure a discussion of a batch "reference architecture" is one that could be carried out in this type of forum.
    I guess I'm just questioning the need for such a framework at all. What does this buy me over just running my Java application from a scheduler? And on a more general level, where I work now, I think we have little or no need for batching but more everything is done as a batch. The batch never completes on time causing a lot of headaches. We end up spending more on infrastructure to try to resolve it when our cpus are basically idle all through the day. From this, I feel there may be a lack of understanding of what the real reasons to batch are and a lot of just doing things the way they have always been done.
  25. Re: the batch playing field[ Go to top ]

    Hi James, "you can use a native JDBC driver that connects directly to the DB without the network overhead and use other features of the platform natively such as XA." What about the case when your business logic is on AS/400, AIX, Linux, Unix, etc and the data is on DB2 UDB or Oracle running on another machine. The local optimization of type-2 drivers is lost; the cost of a transaction suddenly increases dramatically. I ran a performance benchmark at a customer where we compared direct file access for input/output records versus queue access. I wrote about it in the following developerworks article: http://www.ibm.com/developerworks/websphere/library/techarticles/0606_antani/0606_antani.html#N101A2 In that benchmark, I found that direct file access consumed half as many CPU cycles, most of which were zAAP-eligible on the mainframe, and that the overall execution time was reduced in half. Upon further analysis, the frequent commits of transactions was a significant contributor to the performance degradation. You do raise a good point about the elasticity of queues, which can be useful at times. Once again, there are certainly cases where a queue-based approached is the right solution. An issue that will come up is whether the data model and business requirements can tolerate OLTP and batch running concurrently. Often times they cannot. How does one ensure that a sudden and large influx of work to the batch queue doesn't starve out higher priority OLTP workloads using the same tables? The longer the transaction, the more likely other workloads will have to way for access to the data. To start, you will need to build more throttling intelligence into the queuing mechanism. There are many issues that will follow, including how to throttle workloads against OLTP while at the same time ensure that the batch jobs are completed within some time constraint. Customers have adopted insert-only data models to allow for OLTP and batch to run concurrently, there are pro's and con's to this solution; The overarching efforts and impacts to a system would certainly deter large customers from changing to an insert-only data-model. Another minus to a queue-based approach is the amount of data one can buffer during a job. For example, if we know that large amounts of data are to be processed sequentially, we can optimize our calls to the data-store (file, database, etc) by increasing our buffer sizes. This can dramatically improve performance. Of course, this is only meaningful is performance is an important business requirement, sometimes it may not be. Finally, back to the issue of idle cpu's. As Chris had mentioned, many customers have adopted virtualization technologies to increase the overall hardware utilization of their infrastructure. Those infrastructures with a similar profile of what you've experienced would, for example, consolidate other workloads to the platform and adopt a goals-oriented runtime. Goals-oriented allows for application virtualization, where workload management helps ensure that applications are meeting stated workload objectives. I've written about this here: http://www.ibm.com/developerworks/websphere/library/techarticles/0711_antani/0711_antani.html#compare Nitin Gaur, a colleague of mine, describes the difference between the virtualization types here: http://www.ibm.com/developerworks/websphere/library/techarticles/0711_antani/0711_antani.html#virtual This is an interesting discussion, I think we've shamelessly hijacked the discussion thread :).
  26. Re: the batch playing field[ Go to top ]

    Hi James,

    "you can use a native JDBC driver that connects directly to the DB without the network overhead and use other features of the platform natively such as XA."

    What about the case when your business logic is on AS/400, AIX, Linux, Unix, etc and the data is on DB2 UDB or Oracle running on another machine. The local optimization of type-2 drivers is lost; the cost of a transaction suddenly increases dramatically.
    Well, in our case we have OS400 and AIX on an AS400 hardware and DB2 (UDB) on the OS400 partitions and Oracle and DB2 on on AIX. I haven't seen drivers for Oracle like the OS400 native JDBC driver which (from what I understand) execute at much closer to direct file access speeds than normal JDBC drivers. I don't even know if there is one for DB2 on AIX. So your point is well taken. I would note that the real issue might be the lack of such drivers. Essentially the DB is the bottleneck and that's why companies with high throughput needs are moving to data grids. It seems to me that data grid architectures are similar in a lot of ways to message-oriented architectures in that you move the transaction to the data instead of moving the data to the transaction, if you get my meaning. Just to reiterate, if you can get semi-batch behavior by letting queue listeners linger and wait to get a big group of messages before issuing batch updates.
    Once again, there are certainly cases where a queue-based approached is the right solution. An issue that will come up is whether the data model and business requirements can tolerate OLTP and batch running concurrently. Often times they cannot. How does one ensure that a sudden and large influx of work to the batch queue doesn't starve out higher priority OLTP workloads using the same tables?
    That's true. Generally I haven't had to worry about this much. Either the online transactions are happening in a dedicated front-end or the online transactions are asynchronous.
    ...This can dramatically improve performance. Of course, this is only meaningful is performance is an important business requirement, sometimes it may not be.

    Finally, back to the issue of idle cpu's. As Chris had mentioned, many customers have adopted virtualization technologies to increase the overall hardware utilization of their infrastructure.
    Again, I don't want to suggest that batching is not ever valid. My real point in this thread is to understand what requirements and characteristics imply batching so that I can avoid being over-zealous about trying to eliminate unnecessary batching. So here's what I think I've learned at this point: Batching fits when: * you have large, known amount of data that can be processed sequentially. * transactions hold wide-ranging locks that will block other work, especially when the transactions are long. * resources for synchronous work are thin Have I missed anything big?
    This is an interesting discussion, I think we've shamelessly hijacked the discussion thread :).
    Personally, I think this isn't really off topic. If you don't know you need batching, you won't know that you need a batching framework. And in any event, the point of these threads are to raise discussions.
  27. Re: the batch playing field[ Go to top ]

    And on a more general level, where I work now, I think we have little or no need for batching but more everything is done as a batch. The batch never completes on time causing a lot of headaches. We end up spending more on infrastructure to try to resolve it when our cpus are basically idle all through the day.
    Maybe you should have used a batch framework. :-) Gary --- The Bargains are Waiting! Yard Sales at Goyarding.com
  28. Re: the batch playing field[ Go to top ]

    And on a more general level, where I work now, I think we have little or no need for batching but more everything is done as a batch. The batch never completes on time causing a lot of headaches. We end up spending more on infrastructure to try to resolve it when our cpus are basically idle all through the day.


    Maybe you should have used a batch framework.
    Why would I use a batch framework if I don't need batching?
  29. Can someone enlighten me?[ Go to top ]

    I don't really get the point of this. I've dealt with a lot of batch-based stuff in the last few years and in most cases it seems to cause more problems than solve anything. Primarily, it creates a very rigid, lock-step architecture where everything is always out-of-date. Secondly, what I see a is a lot of wasted CPU cycles. Sure, it might be more efficient to process all the data at once, but you end up with idle CPUs between batches. I see so many jobs that "have to run in batch" or they won't finish in time but it seems to me that a lot of this is caused by the batch mentality of waiting until the evening to start processing. I'm not saying that batch shouldn't be used but I see a lot of overuse of batching because of what seems like an assumption that it's the way things are done. I can also understand that some systems really have no spare cycles and need to squeeze every last bit of performance from their system. I feel that this is could be better addressed by architectural changes but I won't go so far as to say that batching is always avoidable. Having said that, I don't see what's so hard about batching that you need a framework for it. If you've got job a scheduler (which Spring batch apparently assumes you do) you just schedule your job. Any enterprise application should have logical units of work. Why do I need to design an application differently for batch than I would if it were, say, queue driven. Clearly I'm missing something. Can anyone clarify?
  30. Re: Can someone enlighten me?[ Go to top ]

    James, interesting points. Experience varies. Clearly you've encountered situations where "batch" has been done badly. I think we can safely say the same about all application types. My experience differs from yours in that I've seen numerous examples of highly flexible, efficient batch-style applications. Some cases involve common business components shared across batch and OLTP applications; others involve integration of batch as a supportive compliment to OLTP. Additional scenarios exist as well. Round-the-clock batch is a growing phenomenon. With regard to CPU usage, in all cases I deal with, I see increased focus on efficient use of machine cycles, with virtualization and other techniques being applied to ensure machines are not idle. Of course, this does not apply to everyone equally, but if you check what the industry analysts (e.g. Gartner, etc) are saying, increasing machine utilization through such techniques has been an emerging trend. Nothing I've said so far justifies a "batch framework." However, one of the possible reasons why batch gets done "badly" could be the lack of structure and organization that a well-designed framework can offer. Just a thought. More specifically, long running batch-style applications that have business requirements for failover and restartability would benefit from a framework that abstracts the details of interaction between the application, it's input/output management and an execution container, such that lifecycle events centered around checkpoint/restart are clearly understood by the application developer and presented in an easy-to-follow pattern, enforced by programming structure. Otherwise, it's purely roll-your-own, increasing the unpredictability of the results. As for why must I design differently for batch than for queue driven? In all things it depends. Some applications should be queue driven. For example, order processing tends toward unpredictable arrival rates and lends itself to queue driven architectures. Other applications do not. For example, billing, portfolio analysis, interest processing, etc differ in that a known, often large, volume of data will be processed. Large volumes that involve read/compute/write sequences do not lend themselves well to queue oriented architectures - the queue simply represents additional data movement, which robs efficiency. Clearly, you would design these applications differently if efficiency was a goal.
  31. Re: Can someone enlighten me?[ Go to top ]

    Nothing I've said so far justifies a "batch framework." However, one of the possible reasons why batch gets done "badly" could be the lack of structure and organization that a well-designed framework can offer. Just a thought.
    We'll most of the batching we do is COBOL/RPG based and we are using the techniques that this framework intends to emulate. But I think the real problem is that a lot of things that could be dealt with immediately are done in batch. For example, pushing all the changes for the day to the data warehouse is done in batch at the end of the day. There's really no need for this and in fact it's caused issues that can only be dealt with effectively using event driven approaches. The batch approach is not only slow but results in data integrity issues. We end up with stale and incorrect data. I'd say it's the organization I work for but I saw the same kinds of things at my last job. I see an assumption that batching will be used for everything by people that are from the COBOL world. In a former life, batching was the exception, done only when no better approach could be implemented without great difficulty or cost.
    For example, billing, portfolio analysis, interest processing, etc differ in that a known, often large, volume of data will be processed. Large volumes that involve read/compute/write sequences do not lend themselves well to queue oriented architectures - the queue simply represents additional data movement, which robs efficiency. Clearly, you would design these applications differently if efficiency was a goal.
    Again, this assumes that you are running at full utilization 24 hours a day. It's unlikely to make sense to sit idle just so you can use the time you haven't wasted more efficiently. Perhaps, but the queue based approach lends itself very well to scalability. Many systems can read and write to the same queue and queues can be distributed across systems. A queue message might just be a pointer to a record in a DB. I might not be able to one machines throughput as high with queues but I can add many machines working asynchronously. We used this to scale up during busy quarters at very low cost at my last job. Rented machines and/or beta servers were added into to increase throughput. Ultimately, the batch-based portions of the design became the bottle-neck mainly because they would overrun their window and then sit idle. Perhaps we could have addressed that better and remained in batch, but an event-driven architecture would have been easier to scale as needed. The other thing that I see with batching (that may not be a requirement) is that dependencies are at a macro level. Job two depends on job one, job three depends on job two. With event-based architectures, each record is available for the next step immediately. You don't need to wait for the last record to finish step one to start on the first records in step two. So even if you can't split one step across multiple machines, each phase of a job can potentially be handled on different systems. Additionally, failures at one point in the flow are inherently restartable at the point of failure with no extra effort.
  32. Re: Can someone enlighten me?[ Go to top ]

    As Snehal points out, we may have shamelessly hijacked this discussion. However, James' remarks invite comment to such an extent that I cannot resist! And while somewhat tangential to the discussion of the Spring batch framework itself, this side conversation nevertheless sheds light on related issues concerning the use of the batch processing model that I believe goes to the heart of whether (and when) one should apply the batch model, and perhaps when one should not. Clearly one has to choose the batch processing model in the first place, for a particular application, before the choice of a framework is relevant. Disclaimer: although we're discussing batch as an application paradigm, and whether or not a batch framework is a useful construct, please understand I am not presently commenting on whether or not the framework proposed by Spring, in my opinion, is or is not a good framework for building a batch-style application. That's a separate discussion. There are other batch frameworks, like the one provided by WebSphere Compute Grid, that are also worthy of consideration.
    But I think the real problem is that a lot of things that could be dealt with immediately are done in batch.


    No disagreement: right tool for the job still important.
    The batch approach is not only slow but results in data integrity issues. We end up with stale and incorrect data.


    Batch patterns are not inherently slow, nor do they inherently compromise data integrity. Those are design/implementation problems. I can show you batch applications that efficiently walk large relational result sets using either pessimistic (locks held) or optimistic (over-qualified query predicate) update techniques to ensure speed and integrity. These are found in finance, trading, inventory-management and other enterprises where data integrity is essential. Such examples are found in overwhelming numbers on the System Z platform. With due respect, I simply cannot accept your statement as a sweeping indictment of batch processing. Perhaps you did not intend it that way.
    Perhaps, but the queue based approach lends itself very well to scalability. Many systems can read and write to the same queue and queues can be distributed across systems. A queue message might just be a pointer to a record in a DB. I might not be able to one machines throughput as high with queues but I can add many machines working asynchronously.


    Queue based is scalable, agreed. The MQ shared queue model comes to mind. This model is truly useful for certain applications. For example, financial journaling, where journal entries go on a queue and are appended to an audit database. However, again it all depends: you can also get in trouble with this model. For example, it doesn't handle massive updates very well. Consider ... A database of 10M rows that must be inspected and any number of them updated. For example, posting interest to accounts. In the queue driven model, passing record pointers, each record pointer results in a separate query to the database. That's 10M select statements! Approached batch style, you would do this with a single select and walk the result cursor. Same amount of data movement; 10M-1 fewer DB round-trips. I understand there is still interaction for buffer and lock management, but the numbers still strongly favor the single, large select. Of course one could argue that a single, large select could be designed into the queue driven approach also - some flag passed in the queue telling the queue reader to "process all records". Fair enough. However... ... questions like checkpointing, restart, failover, job logging, operational control, etc come into play. Common requirements. These issues can find answers in part through frameworks - e.g. checkpoint/restart, job logging - and in part through hosting environment - e.g. failover, operational control, etc. Think back to Snehal's "layers" dicussion. Moreover, a well-considered batch framework would additionally abstract out notions of input and output. With the right abstraction, queues, databases, file, and so forth could all serve as input or output to/from a batch style process. Consider particularly, a queue feeding a "batch" process. That blurs the line with the queue model and really boils down to what qualities of service are available to the application through the framework with which it is built and the environment in which it is hosted. Finally, batch scalability has been demonstrated through parallel processing. Products like WebSphere Compute Grid, Datasynapse, and others provide that today. Digging around the 'net yields this article on parallel batch from none other than the same Snehal Antani, also posting this thread: http://www.ibm.com/developerworks/websphere/techjournal/0707_antani/0707_antani.html
    The other thing that I see with batching (that may not be a requirement) is that dependencies are at a macro level. Job two depends on job one, job three depends on job two. With event-based architectures, each record is available for the next step immediately. You don't need to wait for the last record to finish step one to start on the first records in step two. So even if you can't split one step across multiple machines, each phase of a job can potentially be handled on different systems. Additionally, failures at one point in the flow are inherently restartable at the point of failure with no extra effort.


    Yes, a queue model does that nicely. The same effect has been achieved in the batch world through technologies like MVS batch pipes. So while this discussion ranges a bit, bottom line is queue models, batch models, and others are all valid. A key to good design is recognizing which approach best suits the task.
  33. Re: Can someone enlighten me?[ Go to top ]


    The batch approach is not only slow but results in data integrity issues. We end up with stale and incorrect data.




    Batch patterns are not inherently slow, nor do they inherently compromise data integrity. Those are design/implementation problems. I can show you batch applications that efficiently walk large relational result sets using either pessimistic (locks held) or optimistic (over-qualified query predicate) update techniques to ensure speed and integrity. These are found in finance, trading, inventory-management and other enterprises where data integrity is essential. Such examples are found in overwhelming numbers on the System Z platform. With due respect, I simply cannot accept your statement as a sweeping indictment of batch processing. Perhaps you did not intend it that way.
    No I did not intend it to be a sweeping generalization. I was speaking to a specific issue that we have resolved by removing batching and introducing a event-based model. The even-based model is faster and provides data that was irretrievably lost in the batch model. Effectively, we had a number of tables/file in a developed by a 3rd party from which we needed to capture updates. This was done by hashing the rows can comparing them to previous hashes at the end of the day. This approach creates a number of problems. The first is that there are on the order of hundreds of millions of records in the file and each was being hashed every day even though only a very small fraction were actually updated on a daily basis. Secondly, if there was more than one update per day, only the last was captured. Deletes and updates to the key were not traceable in a reliable manner. Instead, placing a trigger on the file and pushing to a queue is a much more efficient and reliable model. the point I am trying to make is that batching isn't always the most efficient approach. I might be able to process the entire file faster in batch, but processing .01% of the records in a event model is going to be hard to beat, regardless of the overhead of the queue approach. Sometimes the arguments for batching are effectively solutions to problems created by batching. I'm not saying that you or Snehal are making those kinds of arguments. You are both very helpful and your points salient.
    A database of 10M rows that must be inspected and any number of them updated. For example, posting interest to accounts. In the queue driven model, passing record pointers, each record pointer results in a separate query to the database. That's 10M select statements! Approached batch style, you would do this with a single select and walk the result cursor. Same amount of data movement; 10M-1 fewer DB round-trips. I understand there is still interaction for buffer and lock management, but the numbers still strongly favor the single, large select.
    I get the point here. I don't think there's a more efficient way to do this if the record must be locked. And if you are talking about applying interest daily, then it seems to fit pretty well. We have things that have been structured as daily events but in reality 'the business' often would prefer a real-time model.
    Finally, batch scalability has been demonstrated through parallel processing. Products like WebSphere Compute Grid, Datasynapse, and others provide that today.
    I guess the question I have about that is whether parallel batching is a way to address the issues with batching and less about getting benefits of the batching model. In other words is parallel batching something I would use in a green-field project or something I would use only to improve on existing batch solutions?
  34. Re: Can someone enlighten me?[ Go to top ]

    I guess the question I have about that is whether parallel batching is a way to address the issues with batching and less about getting benefits of the batching model. In other words is parallel batching something I would use in a green-field project or something I would use only to improve on existing batch solutions?

    I have direct experience with multiple instances of parallel batch. Some cases are just an improvement on existing batch. For example, imagine that 10M record scenario we discussed previously: break it up into 10 parallel jobs, each processing 100K records - elapsed time drops approximately by a factor of 10. Of course, the data and algorithms had to lend themselves to partitioning for this to work. In each case where parallel batch is applied to improve existing batch, it is fair to ask: should I be batching at all? I also know of a specific case, where a home-spun "batch" system, based on message-driven-beans (i.e. queue driven J2EE component) was replaced with a bona fide parallel batch system, complete with batch framework. In this case, the user was tired of fighting MDB transaction timeouts, lack of control infrastructure for batch, and difficulties with problem determination. Now of course, this doesn't describe everyone's experience with queue based applications. Nevertheless, this was an interesting case. Ok James, so this has been an interesting discussion. I think we have a few conclusions: 1. No single application model fits all circumstances - specifically, some problems are well-suited to queue based models; some to batch; some to other (e.g. request/response) 2. Care should be taken to avoid applying the wrong model to a given problem. 3. Selection of the batch model is a pre-requisite for choosing a batch framework (seems obvious).
  35. Re: Can someone enlighten me?[ Go to top ]

    Ok James, so this has been an interesting discussion.
    I concur. I thank you and Snehal for humoring my questions of this. I feel much more informed about where batch approaches fit in. Before, if I'm going to be honest, I was an anti-batchist. I'm still going to have to struggle with inappropriate use of batching but now I can choose my battles more correctly.
  36. Re: Can someone enlighten me?[ Go to top ]

    I think we have a few conclusions:

    1. No single application model fits all circumstances - specifically, some problems are well-suited to queue based models; some to batch; some to other (e.g. request/response)

    2. Care should be taken to avoid applying the wrong model to a given problem.

    3. Selection of the batch model is a pre-requisite for choosing a batch framework (seems obvious).
    Batches are vunerable to 2 forces that tends to come into play as a product/system matures, i.e. the amount of data in the system increases and the number of integrations to other systems increases. The first force means that each task takes longer to perform, and the second means that the time available to perform each task is shorter. Translated, this means that you need to run more tasks, which each needs longer to complete, within the same batch window. Often this is also coupled with a strong desire to shrink the batch window, partly to allow for longer online open hours, partly to allow complete processes involving batches in several integrated systems to complete in time. Its tempting to try to resolve these issues by throwing out the batch-model (no batches means no batch window, right...), replacing it with either an event based model or by putting the work in online transactions. Both approaches are vunerable to the fact that batch processing can be implemented in a much more efficient manner (bulk IO, bulk transaction control, etc). The online approach is additionally vunerable because it requires online interfaces to surrounding systems.
  37. Re: Can someone enlighten me?[ Go to top ]

    Its tempting to try to resolve these issues by throwing out the batch-model (no batches means no batch window, right...), replacing it with either an event based model or by putting the work in online transactions. Both approaches are vunerable to the fact that batch processing can be implemented in a much more efficient manner (bulk IO, bulk transaction control, etc). The online approach is additionally vunerable because it requires online interfaces to surrounding systems.
    As we have already discussed, it's true that often batching can provide efficiencies but I think a lot of the problem that I face now originate in the fact that batching isn't always more efficient but is often assumed to be. The example I give above is a case in point. Searching though millions of records for changes is much less efficient than capturing the change event for a few thousand records and sending a message to a queue. The search can take several hours where the sum of the event captures is measured in seconds. This aside from the fact that batching doesn't actually produce the correct result. We can then process all the changes stored in the queue in a single transaction and get efficiencies from batching where they actually count. So my upshot from all this is that it's not either-or and when to batch vs. when to use events requires a little thought.
  38. Re: Can someone enlighten me?[ Go to top ]

    The example I give above is a case in point. Searching though millions of records for changes is much less efficient than capturing the change event for a few thousand records and sending a message to a queue. The search can take several hours where the sum of the event captures is measured in seconds. This aside from the fact that batching doesn't actually produce the correct result.

    We can then process all the changes stored in the queue in a single transaction and get efficiencies from batching where they actually count.

    So my upshot from all this is that it's not either-or and when to batch vs. when to use events requires a little thought.
    True, but I think that a more conventional design of the original batch would be a mix between event and batch, i.e. changes to the data would mark it as changed (using a trigger or other means) which would make it easy and efficient to pick the correct targets for the batch. But, I dont really know the whole story of your example, and I am sure that your solution fits the problem at hand. I have created similar solutions, just like I have created batch based solutions. Each have their merit.
  39. Re: Can someone enlighten me?[ Go to top ]

    True, but I think that a more conventional design of the original batch would be a mix between event and batch, i.e. changes to the data would mark it as changed (using a trigger or other means) which would make it easy and efficient to pick the correct targets for the batch.

    But, I dont really know the whole story of your example, and I am sure that your solution fits the problem at hand. I have created similar solutions, just like I have created batch based solutions. Each have their merit.
    We don't have control over the tables or the application that makes the changes. Even adding triggers is tricky because we need to make sure they execute quickly enough to prevent issues with the application making the changes. Pushing the raw bytes from the record to a queue lets us capture the state without blocking the execution of the updates. A couple of the things about your proposed solution: It's still relatively slow unless you index on the flag which essentially means you are placing 'notes' about changes somewhere other than in the data. But most importantly, it still doesn't provide the correct solution. It provides only the ability to determine that something has changed between batches. The queuing solution captures all changes. It also provides no way to track deletes and when keys are modified, there is no way track that the new key has any association with the old key.
  40. Re: Can someone enlighten me?[ Go to top ]

    Pushing the raw bytes from the record to a queue lets us capture the state without blocking the execution of the updates.
    Well, I still dont know the details of your problem, and I am sure you have picked the correct solution. But, I am curious on how you can ensure that you dont lose an event unless you keep the queue put within the same transaction as the actual update (thus blocking)?
  41. Re: Can someone enlighten me?[ Go to top ]

    Pushing the raw bytes from the record to a queue lets us capture the state without blocking the execution of the updates.


    Well, I still dont know the details of your problem, and I am sure you have picked the correct solution.

    But, I am curious on how you can ensure that you dont lose an event unless you keep the queue put within the same transaction as the actual update (thus blocking)?
    We do. I didn't explain it clearly. We are using file triggers (not SQL triggers) and writing the new and old row data to the queue in raw form. That is, we are writing basically an array of bytes to the queue. This executes fairly quickly. It's not completely insignificant. It probably doubles the length of the update. In real terms on our (slow) development hardware, a mass update of several thousand rows goes from something like 2.5 seconds to something like 5 seconds. Most of the updates of single rows made by users who won't possibly know the difference. That queue is then read by another process that breaks out the row data into a nice message and moves it to another queue where the real work happens. The slowest part converting the row bytes into a formatted message. The only work that happens in the DB transaction is that initial push to the queue.
  42. Re: Can someone enlighten me?[ Go to top ]

    We are using file triggers (not SQL triggers) and writing the new and old row data to the queue in raw form. That is, we are writing basically an array of bytes to the queue. This executes fairly quickly. It's not completely insignificant. It probably doubles the length of the update. In real terms on our (slow) development hardware, a mass update of several thousand rows goes from something like 2.5 seconds to something like 5 seconds. Most of the updates of single rows made by users who won't possibly know the difference.

    That queue is then read by another process that breaks out the row data into a nice message and moves it to another queue where the real work happens. The slowest part converting the row bytes into a formatted message.

    The only work that happens in the DB transaction is that initial push to the queue.
    Ok, then I understand. Just to explain my point of view: 2*queue put=2*10ms, plus time for conversion (20 ms?), plus time for real work (40ms) = 80ms. This would move my typical/prototype month-end batch job from 9-12 hours to 48+ hours, thus breaking some of the most important processes of any customer company. Different situations, different solutions. Thanks for the discussion.
  43. Re: Can someone enlighten me?[ Go to top ]

    Ok, then I understand.

    Just to explain my point of view: 2*queue put=2*10ms, plus time for conversion (20 ms?), plus time for real work (40ms) = 80ms. This would move my typical/prototype month-end batch job from 9-12 hours to 48+ hours, thus breaking some of the most important processes of any customer company.

    Different situations, different solutions.

    Thanks for the discussion.
    The whole point of the event driven design would be to either eliminate a lot of the work or do whatever work can be done immediately as soon as resources are available. So in your example if it were possible to do a good portion of the work throughout the month, 48 hours spread across a month is 1.6 hours a day. If you are squeezing that 1.6 hours out of your spare capacity (assuming there is some) it's basically free. So the event driven part of the process may be building a table filled with the data that you will process at the end of the month saving you time in that batch. You don't need to keep the data in the queue for the whole month. An analogy would be the way we did dishes in my college apartment to how I do dishes now. In college, we would just pile the dishes up in the sink until it smelled so bad that I could no longer stand it or we were going to have a party. In my current situation, dishes are done as they are used. On the one hand, the college solution may seem more efficient. You can get everything set up and do 'batch' dish-washing. But in reality it's not because you need to move the dishes out of the way before you can even start washing them. Even if you do get some efficiency out of waiting to do them all at once, it turns into a huge job that takes several hours whereas doing them after each meal only takes few minutes each day i.e. an insignificant loss. Now, it's absolutely possible that there is little to nothing that can be done until the end of the month. In that case, the event-based solution doesn't really provide any benefits and it may be the appropriate choice to not use it.
  44. Re: Can someone enlighten me?[ Go to top ]

    Now, it's absolutely possible that there is little to nothing that can be done until the end of the month. In that case, the event-based solution doesn't really provide any benefits and it may be the appropriate choice to not use it.
    Yes, thats the problem in my typical/prototype case, because every record must be re-calculated based on information not available before a fixed time before mont-end. I have designed event-driven solutions as well, and it works great for use-cases suitable for it.