Discussions

News: Wire speed Java serialization

  1. Wire speed Java serialization (15 messages)

    Billy Newport points to a paper which has information and performance metrics on Java serialization. Billy points out that "Externalization is like 10 x serializable but thats no surprise. Whats cool in the paper is that it describes a myrinet integration serialization mechanism that is over 10x externalizable which is pretty cool."

    Read the paper: Java Objects Communication on a High Performance Network

    Read Billy Newport's thoughts in In search of wire speed Java serialization

    Threaded Messages (15)

  2. Wire speed Java serialization[ Go to top ]

    It must be a pretty old article, seeing how they are using Myrinet and 400Mhz-PIII's. Since this work is so quantitative in nature, it might be worth doing the experiment again using Infiniband and the latest P4 hardware. (Infiniband is 10Gbps whereas their Myrinet is 1Gbps. Also, IB claims 200ns latency).

    I don't understand if they are thinking about Online Transaction Processing applications or compute-clusters. But why would anyone want to write a computational system in Java? And if they are thinking about OLTP, well, then the code is going to be doing a whole lot of object allocation anyway ..
  3. Wire speed Java serialization[ Go to top ]

    It makes their points more valid IMHO. Infiniband is now 10/30bit/sec, Java serialization is pretty hobbling as the wire speeds rise. It would be interesting to do these tests using direct mapped arrays and externalizable or xexternalizable.

    Billy
  4. Wire speed Java serialization[ Go to top ]

    It makes their points more valid IMHO.


    It should, yes. On the other hand a cpu with better memory bandwidth (e.g. p4 with RDRAM 800Mhz) should also run serialization faster. So while the general principle should still be true you should see the numbers before you spend money on anything.

    > Infiniband is now 10/30bit/sec, Java serialization is pretty hobbling as the wire speeds rise. It would be interesting to do these tests using direct mapped arrays and externalizable or xexternalizable.

    Direct mapped array meaning Java arrays with RDMA? Or is a "direct-mapped array" some hardware I don't know of?
  5. Wire speed Java serialization[ Go to top ]

    Even with the CPU increases, I think the wire speeds are now starting to increase faster than CPU, we just had 1gb ethernet, intel are now selling 10gb ethernet. I don't think CPU speed has increased by a factor of 10 in the same time period.

    Direct mapped in the sense of Java NIO. This couple with RDMA with some sort of Java API like the one in the paper is what I'm thinking would be cool.
  6. RDMA[ Go to top ]

    RDMA for those of you for which this is new, basically lets you copy a block of memory from one computer to another using DMA over ethernet. This can be used as the basis of faster RPC/messaging mechanisms between peers in an RDMA network. Latencies should drop to < 20usecs.

    200ns has been mentioned as latency but I think only Voltaires infiniband are the only ones claiming that, my own experience has infiniband at around 10uSecs with RDMA around 10-20uSecs.
  7. Wire speed Java serialization[ Go to top ]

    [..] I don't think CPU speed has increased by a factor of 10 in the same time period.


    I am not sure how you would predict the speed increase, but memory bandwidth might be one way to do it. This is the STREAM benchmark:

    http://www.cs.virginia.edu/stream/peecee/Bandwidth.html

    From looking at the results you could make an argument that the COPY benchmark would go from, say, 300 to about 1900 when you go from PIII motherboards to P4 motherboards. And I haven't seen any benchmarks with 800Mhz RDRAM. It should be even higher.

    I don't know if it's a good way of computing the speed increase, but at least it's better then CPU clock frequency.
  8. Wire speed Java serialization[ Go to top ]

    \Guglielmo Lichtner\
    It should, yes. On the other hand a cpu with better memory bandwidth (e.g. p4 with RDRAM 800Mhz) should also run serialization faster. So while the general principle should still be true you should see the numbers before you spend money on anything.
    \Guglielmo Lichtner\

    A big part of the problem is that Java serialization is a very generalized problem. If you need to send arbitrary object graphs, it can do that - but you're going to pay.

    In this specific case, it's really not a question of seeing the numbers. Raw serialization is _dead slow_. To understand why, take a peek at ObjectOutputStream/ObjectInputStream sometime. I've seen servers under load that literally spend 25% of their CPU time in marshalling and unmarhsalling code.

    Externalizable is, as indicated, better. But it still sucks. What you always come back to is the generalization problem - a generalized object graph is never going to be all that quick to properly marshall and unmarhsall.

    What you need to keep in mind is that this general object graph spitting and sucking is competing with hand-tuned protocols in non-Java realms. For really high speed stuff, people have C code that agonizes over adding 2 extra longs into the header - not an exaggeration. Often their highest CPU cost is converting their native ints and longs into network byte order.

    It comes down to a set of choices - for the most part, developer convenience of dealing with familiar objects, and having them automagically marshalled/unmarshalled, and creating a PITA custom protocol that's very fast but requires developer grunt work.

    In a very real sense, even if you had some magic system with 1 nanosecond transfer times for a K of data, you'd be just as slow today as you would be with a 100MBit network. The reason why is that you're spending all of your time in the serialization code, not flushing buffers out to the network.

    As I mentioned somewhere around here awhile back, for 500 bytes or so of data a serialized network comms package would max out around 800 messages/second (this is with everything Externalizable). An alternate implementation that didn't use serialization maxes out around 2,500 messages/second. I had to work harder in creating a custom protocol, but that protocol is much, much faster than generalized object graph code in ObjectOutputStream/ObjectInputStream.

         -Mike
  9. Wire speed Java serialization[ Go to top ]

    A big part of the problem is that Java serialization is a very generalized problem. If you need to send arbitrary object graphs, it can do that - but you're going to pay.


    Absolutely. But on a cluster it doesn't have to be that general - that problem could be optimized away. The cluster nodes could negotiate an encoding of the class names, for example (e.g. String -> 1, Integer -> 2, etc.)

    So I think that serialization *as applied to cluster communication* does not in principle have to suffer from "generality overhead".

    > In this specific case, it's really not a question of seeing the numbers. Raw serialization is _dead slow_. To understand why, take a peek at ObjectOutputStream/ObjectInputStream sometime. I've seen servers under load that literally spend 25% of their CPU time in marshalling and unmarhsalling code.

    It's still worth seeing the numbers on more modern hardware.

    You also have to consider that just because you optimize serialization for inter-cluster communication doesn't mean that the application as a whole will benefit from it. People do things with Java objects which totally kill throughput (like creating arbitray-sized arrays ..)
     
    > As I mentioned somewhere around here awhile back, for 500 bytes or so of data a serialized network comms package would max out around 800 messages/second (this is with everything Externalizable). An alternate implementation that didn't use serialization maxes out around 2,500 messages/second.

    If you have the details of the benchmark (hardware, jvm, workload ..) I'd be curious to see them.
  10. Wire speed Java serialization[ Go to top ]

    G: But on a cluster it doesn't have to be that general - that problem could be optimized away. The cluster nodes could negotiate an encoding of the class names, for example (e.g. String -> 1, Integer -> 2, etc.)

    That's not the problem at all, it is transitive closure of the graph that is expensive. (I think that's the right term, anyway.)

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  11. Wire speed Java serialization[ Go to top ]

    That's not the problem at all, it is transitive closure of the graph that is expensive. (I think that's the right term, anyway.)


    Then the problem isn't really there for many OLTP applications. Just use tiny graphs.

    For computate clusters, the situation is even better - they probably just need arrays of primitive data.
  12. Wire speed Java serialization[ Go to top ]

    Guglielmo,

    The problem is that they have to support the closure of the graph, even if it isn't used. Writing one Object (with no references to follow) to an ObjectOutputStream and flushing/closing the stream takes literally forever in computer time. That's because the algorithm is very general-purpose. (And it's probably also because the implementation sucks, but I'll let someone else claim that.)

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  13. Wire speed Java serialization[ Go to top ]

    The problem is that they have to support the closure of the graph, even if it isn't used. Writing one Object (with no references to follow) to an ObjectOutputStream and flushing/closing the stream takes literally forever in computer time. That's because the algorithm is very general-purpose. (And it's probably also because the implementation sucks, but I'll let someone else claim that.)


    I don't claim to know much about graphs, but it doesn't take a rocket scientist to see that one way to figure out the closure of the graph is to follow the references. Hence, at least one algorithm takes a time proportional to the number of references. Therefore the best case of this algorithm is very good. So I don't think that's a conceptual hurdle in OLTP. The conceptual hurdles are allocating and garbage-collecting objects, making copies of data over and over, and preserving the architecture-independence of Java. For example, right now there is no Java spec for talking directly to the network card.

    It's interesting to figure the absolute most economical way of shipping objects between jvms. To me, it boils down to 1) just-in-time instrumentation of classes to let them marshall and unmarshall themselves, and 2) negotiating an economical encoding scheme between nodes so the number of bytes being written is as small as possible. For big objects you can actually send diffs, too. And it's probably best to forget ObjectOutputStream because it's bad news. But of course in an open environment, you have to use open protocols.
  14. Wire speed Java serialization[ Go to top ]

    Guglielmo,

    That is still quite a challenge. Consider the simple case of the following class:

    public class TestSer implements Serializable {
      public int x;
      public TestSer ref;
    }

    Now let's make just two of them:

    TestSer t1 = new TestSer();
    t1.x = 1;
    TestSer t2 = new TestSer();
    t2.x = 2;
    // circular refs
    t1.ref = t2;
    t2.ref = t1;

    Now, when we write t1 to an OOS, it has to write t1 and t2, each exactly one time. How does it know to do that?

    And when it reads t1 and t2 from an OIS, it has to put the refs back together, even though it only reads one at a time (so the other doesn't exist yet to ref to).

    In other words, it may be a known problem space, but it is still not a simple problem space. The data structures that it (OOS and OIS) creates internally to support the implementation of the above appear to be responsible for the bulk of the processing cost.

    As far as wire speed and moving data among VMs, Java does that a lot faster than it serializes and deserializes objects ....

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  15. Wire speed Java serialization[ Go to top ]

    That is still quite a challenge. Consider the simple case of the following class:

    >
    > public class TestSer implements Serializable {
    > public int x;
    > public TestSer ref;
    > }

    I'm not really in synch with this. Imagine the objects representing database rows. Then it looks more like:

    > public class TestSer implements Serializable {
    > public int x;
    > public float y;
    > public String z;
    > }

    Basically I am thinking just a bag of primitive types (including Strings). Your example contains a relationship to another object, and on a cluster you don't necessarily want to ship the relationship.
  16. Wire speed Java serialization[ Go to top ]

    \Guglielmo Lichtner\
    Basically I am thinking just a bag of primitive types (including Strings). Your example contains a relationship to another object, and on a cluster you don't necessarily want to ship the relationship.
    \Guglielmo Lichtner\

    Well, again it comes down to a generalized mechanism for transporting objects vs. a specialized protocol. OOS and OIS can always be handily beaten by a specialized protocol, because the specialized protocol can ignore issues that are central to OOS and OIS.

    The minute you involve object graphs, you run into uglies such as Cameron mentioned. Unfortunately, for generalized Serialization, you need to support these cases. But you're correct that for a cluster solution, if you want speed you're better off specializing your protocol. This was my argument earlier in the thread.

    Back to the examples - even if you're using simple types with no mutual references, unfortunately a generalized solution still has to _check_ for such situations. This is why simple Externalizable approaches show a speedup over Serialization, but this speedup still doesn't approach specialized protocol speeds. You eliminate much of the Reflection required by Serializable, but you're still paying a hit for all the special checks OOS/OIS have to perform.

         -Mike