New Java Framework For Data-Intensive Java on Multicore

Discussions

News: New Java Framework For Data-Intensive Java on Multicore

  1. J2EE and the myriad of other web app frameworks have served us well. Why build a web app from scratch including bean pooling, threading, connection management etc.. when it's already done for you? But when Java developers sit down to build a data processing application (financial services data, insurance claims, health informatics, bio-research, the works...) they have nothing. Nadda. No help. Let me be more specific about the application here -- it's not an OLTP model. Not SOA or ESB based. This is bulk (GB or TB) data processing when you have minutes to spare, not hours to wait. With the "multicore arms race" now in full swing, Java developers can no longer wait for CPU clock speed to save their application's poor performance. I blog about it in detail on my blog. Well, I'm pleased to announce to the Java community that Pervasive DataRush Beta 1 is available for download. DataRush is a light-weight (less than 3 MB on disk) but extremely powerful parallel processing engine framework. It's 100% Java and runs on Java 5 SE. It handles all the parallel programming for you including horizontal, vertical and pipeline parallelism. In fact, you can code many data processing applications using XML scripting and our out-of-the-box library of Java operators. We've started benchmarking this framework against well-known algorithms out there and have found that, vs. Perl or non-threaded Java, we can cut the runtime to 1/10th of prior performance time in some cases. Not all Comp Sci problems can be made parallel, so I'm not claiming a magic wand here -- but even with the not-so-parallel algorithms, DataRush gives you pipeline parallelism (each module of your algorithm runs on a separate CPU core while data flows dynamically through them). I've posted one such benchmark on the website and will keep posting as they become available. Download it. Try it. Let me know what you think. We've just launched the beta program so now is your chance to be heard and have your ideas change the course of DataRush. Thanks for spreading the word! Emilio Bernabei Director of Product Management Chief Evangelist, DataRush Message was edited by: joeo@enigmastation.com
  2. Concurrent programming should be easy and if possible standard (not proprietary). An example of such open framework is Javolution (only 250 Kb) which provides among other things ConcurrentContext to transparently take advantage of multi-cores. Here too, you don't have to mess with threads or synchronization. Just write your code in concurrent manner and run it! ConcurrentContext can be disabled at runtime in order to measure their effect on execution speed. Here is an example of concurrent/recursive quick sort illustrating how simple it is: void quickSort(final FastTable<!--? extends Comparable--> table) { final int size = table.size(); if (size < 100) { table.sort(); // Direct quick sort. } else { // Splits table in two and sort both part concurrently. final FastTable<!--? extends Comparable--> t1 = FastTable.newInstance(); final FastTable<!--? extends Comparable--> t2 = FastTable.newInstance(); ConcurrentContext.enter(); try { ConcurrentContext.execute(new Logic() { public void run() { t1.addAll(table.subList(0, size / 2)); quickSort(t1); // Recursive. } }); ConcurrentContext.execute(new Logic() { public void run() { t2.addAll(table.subList(size / 2, size)); quickSort(t2); // Recursive. } }); } finally { ConcurrentContext.exit(); } // Merges results. for (int i=0, i1=0, i2=0; i < size; i++) { if (i1 >= t1.size()) { table.set(i, t2.get(i2++)); } else if (i2 >= t2.size()) { table.set(i, t1.get(i1++)); } else { Comparable o1 = t1.get(i1); Comparable o2 = t2.get(i2); if (o1.compareTo(o2) < 0) { table.set(i, o1); i1++; } else { table.set(i, o2); i2++; } } } FastTable.recycle(t1); FastTable.recycle(t2); } }
  3. Hey Jean-Marie, how's it going? I really like your direction. I want as many folks in the Java community to benchmark the various frameworks out there on several levels: a) Speed to learn a framework b) Speed to code a solution to a problem w/ framework c) Ability for framework to scale efficiently on multicore Also, to all the readers out there, I may not have stressed this enough. DataRush is for DATA-INTENSIVE problems. Javolution may be better for you in some cases. We tried to balance speed of development (hence XML scripting language for expressing dataflow graphs) with the overhead of the framework. In some cases, maybe Javolution is lighter weight... I don't know. To that end, some things to consider: 1. If you can use DataRush XML scripting to do the same sort operation shown here, is that valuable? 2. Here you see code for merging. DataRush does that for you as implied by the use of our join operator in the XML. 3. Benchmark it! Proof is in the pudding. I leave it to readers to balance design performance and maintenance of code vs. n-th degree speed/throughput needs. Thanks for downloading DataRush.
  4. Hey Jean-Marie, how's it going? I really like your direction. I want as many folks in the Java community to benchmark the various frameworks out there...
    Agree! Multi-core is here to stay and we can expect more and more various solutions in 2007. Most likely a new JSR will be started and a standard solution will emerge. BTW: Really nice site with many interesting links.
  5. Give us some code examples ;-)[ Go to top ]

    Hello. I've been evaluating the "multicore switch" for a few time from an architectural perspective on my blog. "Architectural perspective" means to me that an architect must be comfortable with a range of possible solutions, starting from home-made designs for the simpler cases to the use of an existing framework from more complex cases. One size won't fit all. So it's good that more products come to light, such as DataRush or Javolution (I'm also writing a very small framework which is specifically dedicated to parallel image processing, which is a really specific task). But it's hard for the architect to build a deep knowledge of these tools as there are more and more and a deep knowledge requires time. Both DataRush and Javolution are on my radar (I think I've been also emailed by the CTO at Pervasive) and I'll be trying them ASAP. One suggestion to Emilio: I think that posting a brief piece of code, as Jean-Marie did to give a rough idea of the framework, is a very good thing. I looked for some example at the Pervasive site, but I couldn't find any. I'd strongly suggest to publish some.
  6. Simple sort example[ Go to top ]

    Here is a simple example that sorts ten randomly generated integers: <!--?xml version="1.0" encoding="UTF-8"?--> <!-- (c) Copyright 2006 Pervasive Software Inc. All rights reserved. --> mwalker 2006-12-23 Demonstrates the sort operator. Note that I've explicitly set the "partitionCount" property of the sort operator to zero, which means its will automatically assess hardware parallelism and internally exploit that information. The results of running this code: C:\workspace\testing>dfe -cp build\dist\testing.jar tests.learning.SortTest 2006-12-23 08:42:11.419 INFO SortTest.log.logType.customize Input type is int 2006-12-23 08:42:11.740 INFO SortTest.sort.sortDispatcher.run Sorted 10 rows in memory 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 1 is -1261162224 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 2 is -982258612 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 3 is 15967868 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 4 is 175398363 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 5 is 185780738 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 6 is 980354890 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 7 is 1570617054 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 8 is 1711199546 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 9 is 1962638078 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run Row 10 is 2022775704 2006-12-23 08:42:11.960 INFO SortTest.log.logRows.run There were 10 rows sinked 2006-12-23 08:42:11.970 INFO com.pervasive.dataflow.tools.cli.DFECLI.processLeftoverArgs Job runtime: 1.815 As a word of warning, I'm relatively new to DataRush, so take what I say with a grain of salt. However, it seems to me that coding in DataRush involves a new way of looking at concurrent programming. In DataRush, you are chiefly concerned with building a network of concurrent processes. DataRush gives you a language (the XML you see above) and a standard library of operators (like sort) from which to construct your apps -- you don't have to be concerned with threading or traditional concurrency at all. In this sense, it is much "higher level," hiding not only the details of the threaded implementation, but also providing you a structure within which to operate: the process network.
  7. Re: Simple sort example[ Go to top ]

    There is much, much more, but I would like to add an important side note here. The "operators" you use in your process network could be anything from a Java class we provide, a Java class you write or an entirely new "assembly" you previously created using XML. So you can see how the process network lends itself to reuse and how you don't have to always write Java to create new "operators" -- sometimes it's as simple as an XML dataflow snippet. The engine will look at the nested process networks, the memory afforded to the JVM and the number of cores available, then create a more detailed (expanded) process network of parallel threads at compile time. NOTE: "compile time" for DataRush is actually during runtime... it's sort of like a pre-processing step done just prior to running the job. Of course, that's just the beginning. At runtime, the engine has to manage in-memory queues of data as the readers stream data throughout the process networks. There's so much to say... not enough real estate. I haven't even started into what we call "customizers" -- the way you give intelligence to your custom operators so they can self-assess parallelism strategies. The docs are fairly robust for a Beta. So...Download.
  8. Simple sort example.[ Go to top ]

    The problem with simple examples is that often simpler using the most obvious way to do the same thing. For example, which would you say is simpler to integrate into another application such as Tomcat? At what point would a developer say the example you provided is worth the extra effort to learn, support etc? public class RandomIntegers { public static Integer[] getSortedRandomIntegers(int num) { Random random = new Random(); Integer[] ret = new Integer[num]; for(int i=0;i<num;i++) ret[i] = random.nextInt(); Arrays.sort(ret); return ret; } public static void main(String... args) { int num = args.length> 0 ? Integer.parseInt(args[0]) : 10; System.out.println("Random numbers length= "+num); System.out.println(Arrays.asList(getSortedRandomIntegers(num))); } }
  9. Peter, I agree that using a 'hello world' example is not the way to justify the ROI of learning DataRush. We were just trying to compare concurrent code constructs. I would say if your web app needs to sort 10 integers -- stick with Array.sort() !! But if you want an example that better illustrates the counter point to using custom code let's use 1 million records. Each record has 200 integers. Now sort the 1 million records 100 times -- once for every 2nd field in the record. Your code has no concurrency that I can see. Maybe I'm missing it -- I'm not claiming to be a J2EE expert developer. So what would Tomcat do with 2, 4, 8, 16 cores to work with? How would your code look? How would it vary it's concurrent thread count based on available processors? Or would it? How would it vary the batching of data if you vary memory from 1 GB to 16 GB? I would have to say sorting is 'boring' and also too simplistic a business problem. Maybe some smart sorting expert implemented the world's best Array.sort() class?? I guess now that Sun Java is GPL I could go look :-) Tomcat is awesome. Don't get me wrong. But it's like using a hammer for a drill's job. Let J2EE do the real-time SOA and the Web App serving and let DataRush do the non-real-time data processing IMHO. With the chip vendors doubling the number of cores every 18 months now, I would suggest that developers calling Array.sort() are not going to get the ROI their CIO is looking for... but maybe your competition will. Where between 10 and 1 million records is the cross-over? Only a good ROI calculator knows ...hmmm there's another TODO for this holiday season....
  10. I agree that one solution will not fit all uses. Different people approach the same problem in different ways. Essence Java Framework supports multi-threading at the component level and is externally configurable. It uses the built-in Java 5 SE concurrency libraries to support multi-threading and the number of threads in the pool for a component is configured in an external file. a) Speed to learn a framework. Essence does not require you to learn a single essence class. It does however use programming by convention. For example you must have a constructor which takes all arguments rather than setters. If you want to make full use of Essence, its API is still not large. Getting started Javadoc b) Speed to code a solution to a problem w/ framework A sample application with a JMS client and broker is provided. The jar for the JMS client and broker is 39K and is 354K including all required JARs. It has two, one page configuration files (60 lines in total). (One for the client and one for the broker) c) Ability for framework to scale efficiently on multi-core As Sun have demonstrated, the out of the box libraries are improved from one version of Java to the next. Essence uses these libraries directly rather than creating new ones or a new layer on top so it can take full advantage of any enhancements Update: Java 6 Leads Out of Box Server Performance What does it give you? Transparent high performance clustering across servers, tested in 2-way for 4-way mastering. Benchmarks