JGroups evaluation article

Discussions

News: JGroups evaluation article

  1. JGroups evaluation article (50 messages)

    An INRIA research group has released an evaluation of the popular JGroups group communication middleware used in many open source application servers. The article evaluates different protocol stacks and network impact on performance and scalability in the context of J2EE application servers.

    The article is available for download from the ObjectWeb JMOB web site.
    The direct link to the article is here.

    Threaded Messages (50)

  2. JGroups evaluation article[ Go to top ]

    Cameron claims his thing transfers of over 11.4 MB/s on 100Mb connection. 2+ times faster than JGroup-UDP speed reported in the paper.
  3. JGroups evaluation article[ Go to top ]

    Having to support reliable multicast UDP communication is not a piece of cake. Drop a couple of packets and you get only a fraction of the throughput. Then lose the ordering of the packets and you are in a slump. The situation can turn pretty sour if you use large messages that get fragmented (more than one UDP packet). You only need to lose one packet to lose your entire message - then you resend everything. The point is that in unideal and congested conditions you start to miss TCP, and it is probably extremely hard to beat its adaptive protocol that has been perfected for so many years.
  4. JGroups evaluation article[ Go to top ]

    ...Drop a couple of packets and you get only a fraction of the throughput. Then lose the ordering of the packets and you are in a slump. The situation can turn pretty sour if you use large messages that get fragmented (more than one UDP packet). You only need to lose one packet to lose your entire message - then you resend everything...
    Changed ordering plays little or no role at all in the performance and you only need to resend NACK-ed packet and not entire message. Furthermore, a bit more involved variations would group NACK-ed packets and send them as one bigger message without fragmenting.

    Your point about TCP is valid in regards to reliable unicast (UDP). As far as reliable multicast – it simply can be sanely implemented using TCP.

    Regards,
    Nikita.
    xTier - Service Oriented Middleware
  5. JGroups evaluation article[ Go to top ]

    As far as reliable multicast – it simply can be sanely implemented using TCP.
    Nikita, can you explain this a little?
  6. JGroups evaluation article[ Go to top ]

    Nikita, can you explain this a little?
    My point was more to the point that TCP is not suited for multicasting functionality at all. So, reliable or not reliable - TCP is just not the right choice and therefore comparing multicast and TCP is like comparing apples to oranges.

    Regards,
    Nikita.
    xTier - Service Oriented Middleware
  7. JGroups evaluation article[ Go to top ]

    Hi Talip, my post was completely unrelated to this (I didn't even see this paper until today,) and isn't an apples-to-apples comparison at all.

    Bela has posted comparisons to other open source group comms libraries in the past, and jGroups has always been inexplicably sluggish. In the paper that is linked to here, you can see that increasing the hardware (for the NxN tests) by 800% (2 machines to 16 machines) improves the overall throughput only by 50% (aggregate of 1932 msgs per second with 2 machines up to an aggregate of 2960 msgs per second with 16 machines,) and thoses messages are only 1 byte. (Note: The graphs use logarithmically scaling, so this information is not at all obvious unless you analyze the graphs and accompanying text very closely.)

    Without more test points, it is impossible to tell if those servers are positive or negative scale. For example, it might peak at 9 or 10 servers, and start losing throughput after that (so each additional server makes the overall cluster throughput lower.) It's clear though that even if they are positive, each server is adding 0.03 of linear scale -- at best -- by the time you get to 16 servers.

    The scaling target is 1.0, which is generally unachievable. Anything above 0.70 is considered acceptable (but not good -- "good" is like 0.96 or so) for vertical scaling. Without the raw numbers, it's impossible to determine the actual scaling percentage for additional nodes, though. I attempted to generate the numbers using statistical analysis (using Excel.) It appears that the scaling rate is 0.59. That means that server #3 added over 422 messages per second, while servers #9-#16 (in aggregate!) added a total of 42 messages per second. In other words, server #3 is 8000% as effective for scaling as the average of server #9 through #16. To put it into English, if you work a total of only 5 hours a week for $10,000 a year (i.e. the server #3 rate,) would you consider it an effective use of your time to work an additional 40 hours a week for a total of only $1,000 more a year (i.e. the aggregate rate of servers #9-16)? (I think that this explains why the log10 scale had to be used to show the results.)

    The 1-N results are much better (relatively speaking,) but I would think that for 1-N communication, it would be far simpler to use a simple TCP/IP client/server architecture, particularly since these tests used TCP/IP for data transfer anyways. In other words, clients would open a TCP socket to the server, and the server would write messages to all client sockets (either on separate threads or in a simple loop.) Without the additional overhead of jGroups, it should be significantly faster.

    In terms of functionality, the NxN is the aspect that is most compelling, since 1xN is easy with TCP/IP, but from these results it is obvious that it's only usable in very small clusters and/or with extremely light load.

    Emmanuel, if you are reading this, could you please publish your raw numbers? Maybe as an appendix or something? The graphs look nice, but with the log10 scaling, they cannot be used to infer any realistic conclusions.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  8. TCP/IP comment[ Go to top ]

    The 1-N results are much better (relatively speaking,) but I would think that for 1-N communication, it would be far simpler to use a simple TCP/IP client/server architecture, particularly since these tests used TCP/IP for data transfer anyways.

    TCP is always simpler but UDP gives better performance for 1-N communications (except for 2 nodes). This is explained in section 5.4 (look at Figure 12).

    >In other words, clients would open a TCP socket to the server, and the server would write messages to all client sockets (either on separate threads or in a simple loop.) Without the additional overhead of jGroups, it should be significantly faster.

    No because you have to duplicate the number of messages (1 UDP multicast message vs n unicast messages with TCP). What makes the difference is the efficient flow control of TCP.

    >Emmanuel, if you are reading this, could you please publish your raw numbers? Maybe as an appendix or something? The graphs look nice, but with the log10 scaling, they cannot be used to infer any realistic conclusions.

    You are right, we will put all results on the web site next week.

    Emmanuel
  9. Results online[ Go to top ]

    The full results are now available online at http://jmob.objectweb.org/jgroups/JGroupsJMOBResult.htm

    Emmanuel
  10. Results online[ Go to top ]

    Hi Emmanuel,

    Thanks for posting the data. Out of curiousity, did my hypothesis on the scaling issues match what you saw?

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  11. Results online[ Go to top ]

    /
    Actually, that number was going through both a switch and a hub (in series).
    /

    Reading that statement again, it seems like you might have been talking about a network consisting of a switch and a hub...I read it as "I got the number with a switch and I got the number with a hub". My apologies if this is not what you meant.

    As for hiding behind an alias, it is out of necessity. I do not work for a competitor trying to discredit you.
  12. Whoops, wrong thread[ Go to top ]

    should have posted in the thread titled "100% with hub??"
  13. Results online[ Go to top ]

    What part of "both a switch and a hub (in series)" was confusing? :-)

        -Mike
  14. Results online[ Go to top ]

    The word "both" confused me, though the last two words do add more context. I am sure you will appreciate that the slightest dislocation of a word can change the meaning of a sentence, as the location of the word "only" in the following examples show:

          A. Only my friend drew the picture of the child yesterday
          B. My only friend drew the picture of the child yesterday.
          C. My friend drew only the picture of the child yesterday.
          D. My friend drew the only picture of the child yesterday.
          E. My friend drew the picture of the only child yesterday.
          F. My friend drew the picture of the child only yesterday.

    No conspiracy here, please rest easy.
  15. Results online[ Go to top ]

    Reading that statement again, it seems like you might have been talking about a network consisting of a switch and a hub...I read it as "I got the number with a switch and I got the number with a hub". My apologies if this is not what you meant.

    The network that we test on internally has two switches and one hub, with multiple computers hung off of each of those devices. The hub is as much a historical artifact as anything. However, due to the fact that it caused some problems to occur in some of our earliest load tests, it has earned a perpetual license to life as a handy testing tool. In fact, I've been considering finding some old 10Mbit "crappy" switches and hubs just to make things harder on the tests. We've had customers with networks as slow as 2.4Mbit -- a Token ring WAN, if I remember correctly.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  16. Results online[ Go to top ]

    If I may translate slightly, Cameron, for those who aren't network buffs - the slower/crappier the network hardware....

      - the more network constrained you are, showing your perf w/ a slow network
      - the more likely you are to exaggerate and elongate latency - and thereby more likely to catch time-related bugs and race conditions.
      - The more likely you are to lose packets one way or another, thereby exercising your "find my packet, please!" protocol.
      - The more likely you are to find out if you have good flow control which throttles back (a la nagle, or other algorithms)...if your flow control is not so good.

    At home I often get this effect naturally by using my laptop's wireless connection (11Mbits if you are truly free of sin, rather less if you're like me) and working three floors down in the basement for awhile (no joke!).

    I do the same thing to help pop out race conditions by working on older, slower machines or, if I'm cursed with a speedy one (damn you, Moore!) running a CPU-pinning program in the background. Or sometimes a really bad profiler will have the same effect :-) The point is to slow things down by an order of magnitude or several, elongating times and making "simultaneity" more likely.

    A slow or packet-losing-prone network has the same sort of effect, with the added "benefit" of lots of lost packets.

         -Mike
  17. JGroups evaluation article[ Go to top ]

    Cameron claims his thing transfers of over 11.4 MB/s on 100Mb connection. 2+ times faster than JGroup-UDP speed reported in the paper.
    Cameron reported the throughput of a point-to-point connection. In no way is that indicative of multicast performance, which is what matters for clustered cache coherency.
  18. JGroups evaluation article[ Go to top ]

    Cameron reported the throughput of a point-to-point connection. In no way is that indicative of multicast performance, which is what matters for clustered cache coherency.
    You are right Brian, it doesn't mean that multicast will perform the same. here is my reasoning behing my statement. at section 5.1 in the paper, it says JGroup's maximum is 5MB/s. I thought that max. will happen when there is one sender one receiver. in this case, i will assume that group comm. software will know that there is only one receiver so it will use UDP unicast rather than multicast because there is no reason for multicasting to one receiver. then communication happens point-to-point. my assumptions might be wrong of course, please correct me if it is.
  19. JGroups evaluation article[ Go to top ]

    At section 5.1 in the paper, it says JGroup's maximum is 5MB/s.
    JGroup appears to be the industry's performance leader. AFAIK, Tangosol has made no quantified performance claim.
  20. JGroups evaluation article[ Go to top ]

    Brian,
    At section 5.1 in the paper, it says JGroup's maximum is 5MB/s.
    I would suggest that a reasonable achievable number is roughly two times that. (That was a 1xN test.)

    Brian: JGroup appears to be the industry's performance leader.

    As I mentioned, Bela's own numbers would clearly disagree, and those were in comparison with other open source comms libraries.

    AFAIK, Tangosol has made no quantified performance claim.

    There's a good reason for that: Tangosol doesn't sell a comms library.

    Please stay objective. Read the paper. Read what I posted above. Is there anything I said that is obviously incorrect? Do you think that scaling "up to" 1KB/s on a gigabit network could indicate a problem? You can bury your head in the sand, or you can look at it objectively and realize that there might be some real technical problems.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  21. About bandwidth[ Go to top ]

    Just a comment on the previous posts that discussed bandwidth. You have to take care that the article discusses throughput per node and NOT aggregated throughput.
    You can have 5MB/s per node with 16 nodes (in which case we report 5MB/s), but the aggregated bandwidth is 80MB/s (values usually reported by Cameron).
    If you look at section 5.3 of the article, especially Figure 10, you will see that FastEthernet is saturated starting with messages of 10KB.
  22. About bandwidth[ Go to top ]

    I find bandwidth measurements to not mean all that much - it doesn't translate well into business needs at all. Msgs/sec is a much better indicator - given _my_ messages of some average size, how many can I pump through the system?

    Looking at the various graphs, JGroups seems to be limited to a max of ~1,300 messages/second, which seems quite low to me. Talk about bandwidth utilization increasing as you increase message size is rather meaningless - the larger the payload, the more work you're putting on the network stack e.g. larger payloads don't test the communications software very much, just the OS IP stack and the network itself. With dual 900 MHz systems I'd expect alot better throughput. People are talking about network saturation - but for 1K-sized messages clearly the network isn't saturated, and it's still stuck at 1,300 msgs/sec. This is why I always like to see CPU utilization #s in conjunction with these sorts of tests.

        -Mike
  23. About bandwidth[ Go to top ]

    I find bandwidth measurements to not mean all that much - it doesn't translate well into business needs at all. Msgs/sec is a much better indicator - given _my_ messages of some average size, how many can I pump through the system?Looking at the various graphs, JGroups seems to be limited to a max of ~1,300 messages/second, which seems quite low to me. Talk about bandwidth utilization increasing as you increase message size is rather meaningless - the larger the payload, the more work you're putting on the network stack e.g. larger payloads don't test the communications software very much, just the OS IP stack and the network itself. With dual 900 MHz systems I'd expect alot better throughput. People are talking about network saturation - but for 1K-sized messages clearly the network isn't saturated, and it's still stuck at 1,300 msgs/sec. This is why I always like to see CPU utilization #s in conjunction with these sorts of tests. -Mike
    I can also cheat on this one by using bundling in the transport. Wait for n bytes to accumulate or m ms until you send all bundled msgs in one (larger) packet.

    *Without* cheating my prelim measurements indicate a msg rate of about 6000-8000 msgs of 1K size in my previously explained configuration.

    I hope to wrap up my tests and publish them within the next weeks.

    Bela
  24. About bandwidth[ Go to top ]

    I can also cheat on this one by using bundling in the transport. Wait for n bytes to accumulate or m ms until you send all bundled msgs in one (larger) packet.
    You consider batching requests up cheating? I thought just about every comms package had an option to batch requests into one larger requests. In fact it's a common technique when dealing with any "slow" resource. If this is cheating than just about everyone writing software with I/O in it has been cheating for the past 4 decades or so :-)
    *Without* cheating my prelim measurements indicate a msg rate of about 6000-8000 msgs of 1K size in my previously explained configuration.
    I guess we'll have to wait and see your results and raw data and see how they compare to Emmanuel's raw data when he posts it.

    As an aside it'd be really nice if per test you published:

       - Avg. CPU Utilization
       - Avg. Throughput (msgs/second)
       - Avg. Latency (Important! - I could get 100,000 msgs a second if I'm allowed to deliver messages a few days later :-)
       - Min Latency
       - Max Latency
       - Max throughput
       - Min throughput

    Yeah, I know it's alot of work :-) But data of this sort is invaluable to get a peek into some of the critical characteristics of the running code - avg throughput alone can be a very deceiving number.

    For my own NIO stuff I'm building these things in as optional stats that are held within the running state that you can query, and as a start the code now has stats for avg. write/process queue rates, multipliers for greedy reads/write/processes, etc. It's been awhile since I played with JGroups but if it doesn't have this I highly recommend it - having stats _built in_ (and probably with a capability to disable them if you're really perf oriented) to the running core code is pure heaven for tuning configurations, debugging problems, and as a quick gauge for end users to gauge performance with their real application (as opposed to getting it from a performance driver).

       -Mike
  25. About bandwidth[ Go to top ]

    Looking at the various graphs, JGroups seems to be limited to a max of ~1,300 messages/second, which seems quite low to me.
    Note that in the J2EE context, you only use group communications for udpates. If your workload is 85% read (like most eCommerce applications), it means that if JGroups saturates at 1,300 msgs/sec you will have to handle more than 10,000 reqs/sec with more than 8,500 of them not doing any group communication. It is not that obvious that in this context the group communication middleware will be the bottleneck.
    This is why I always like to see CPU utilization #s in conjunction with these sorts of tests.
    We are working on it. It is hard to have a distributed monitoring framework that reports accurate resource usage with a limited intrusiveness. Moreover, the average usage usually does not have any real meaning since the behavior is usually be bursty.

    Emmanuel
  26. About bandwidth[ Go to top ]

    Note that in the J2EE context, you only use group communications for udpates. If your workload is 85% read (like most eCommerce applications), it means that if JGroups saturates at 1,300 msgs/sec you will have to handle more than 10,000 reqs/sec with more than 8,500 of them not doing any group communication. It is not that obvious that in this context the group communication middleware will be the bottleneck.
    A few notes on this:

       1) A single logical end-user request can easily result in multiple update operations. 1 user request can end up equalling "N" group communications.

       2) Saying "...your workload is 85% read (like most eCommerce applications)" hits the old addage - the industry average is X, which guarantees that almost no one actually is exactly at X :-). What matters to people is not what the industry average is, but what their own application does. What if _my_ app has only an 60% read average? What if my app can be bursty e.g. a business event can cause an upswing in updates over several minutes? Tie this in with #1 above.
    We are working on it. It is hard to have a distributed monitoring framework that reports accurate resource usage with a limited intrusiveness. Moreover, the average usage usually does not have any real meaning since the behavior is usually be bursty.
    I hear what you're saying, having done this sort of thing myself, but I can tell you that despite the difficulties, having some data on CPU usage is much better than having none. Even a ballpark eyeballed estimate by looking at a graphical CPU monitor can be highly enligtening. It particularly helps in eliminating assumptions. Rather than make assumptions about network bottlenecks and bandwidth limitations, throw in a simple CPU monitor and you may find that your CPU is pinned at 100%, and that network analysis and talking about net saturation is useless when it's obvious that your CPU is starting to smoke a bit :-)

    In my own network testing, it's common for people to assume that the network is going to be the bottleneck, and find out later that it wasn't the network throttling them but excessive CPU usage, or a poor choice of distribution algorithm, or something else entirely that has nothing to do with the network.

    In the JGroups analysis you did, a number of tests ended up getting nearly the same msgs/sec rates, and larger messages lead to better net utilization and not much else. OK - the larger the message the more you're offloading work from the comms framework (JGroups) and the more you're putting on the TCP/IP stack and network itself. Clearly the net can efficiently zoom data around 'cause larger messages are getting much more data out on the network per second. At lower message sizes JGroups itself becomes more prominent in the loop. The question then becomes "why is JGroups under-utilizing the network so much for msgs <=10K?".

    The reason why this is all important - with CPU usage included - is to really see the impact this will have on real communications. I don't know what exactly was going on for this test, but imagine for a second that JGroups was eating %75 CPU for this test and getting 1,300 msgs/second. Now imagine that you put a real application and potentially heavy workload on the server. In this scenario, with JGroups eating 75% of your CPU, that doesn't leave much for your app code :-/ So your comms and application code could be competing for CPU and limiting your msg rates that way as well.

         -Mike
  27. About bandwidth[ Go to top ]

    A single logical end-user request can easily result in multiple update operations. 1 user request can end up equalling "N" group communications.
    Developers have to be aware that group communications are always expensive. You have to design your application with clustering in mind if you want it to scale. If every single update in the system translates to a group communication, then you are dead for sure.
    The question then becomes "why is JGroups under-utilizing the network so much for msgs <=10K?". The reason why this is all important - with CPU usage included - is to really see the impact this will have on real communications.
    We are talking about the application data that is sent through JGroups. The real amount of data that will go over the wire is highly dependent on your stack configuration. With larger messages, the extra information added by JGroups is negligible while it can be prominent with small messages.

    About the CPU usage, even if we can give a rough idea (0, 25, 50, 75 or 100%), it only means something for our setup (hardware, Linux kernel version, JVM and so on). You just change one thing in this configuration (especially kernel and JVM) and the results might be very different. Our philosophy is to talk only about results when we have strong numbers to backup our statements. This is currently not the case for cpu numbers.

    Emmanuel
  28. About bandwidth[ Go to top ]

    Developers have to be aware that group communications are always expensive. You have to design your application with clustering in mind if you want it to scale. If every single update in the system translates to a group communication, then you are dead for sure.
    Of course, but that's not my point. My point was that you were best-casing several scenarios at once - assuming that one user request translates into one group communication, and that 85% of requests are read-only. With those assumptions, an app may be OK with the numbers you gave.

    But change the assumptions slightly and the end results can change much more dramatically, due to multiplier effects. Something as innocent as dropping the read-only percent by 20% and averaging two group updates per request could change the picture from acceptable to not-acceptable.
    We are talking about the application data that is sent through JGroups. The real amount of data that will go over the wire is highly dependent on your stack configuration. With larger messages, the extra information added by JGroups is negligible while it can be prominent with small messages.
    I think you missed what I was getting at. For stacks that don't interpret the payload (most I believe), when your message size gets very large you're no longer testing JGroups - you're testing the TCP/IP stack and the network. Tests with smaller messages show how well JGroups itself does e.g. it accentuates whatever overhead JGroups (or any comms package really) is adding to the equation.
    About the CPU usage, even if we can give a rough idea (0, 25, 50, 75 or 100%), it only means something for our setup (hardware, Linux kernel version, JVM and so on). You just change one thing in this configuration (especially kernel and JVM) and the results might be very different. Our philosophy is to talk only about results when we have strong numbers to backup our statements. This is currently not the case for cpu numbers.
    I'm sorry but this doesn't hold water. _Every single number you give_ is dependent upon your setup. As you say "just change one thing" and your throughput and latency numbers can change. This isn't a reason not to quote the numbers - you just tell people what the setup is so they can quantify it.

    To put my view on this plainly - ballpark CPU numbers are far better than _no_ CPU numbers. Data with an accuracy of +- 10% (or even 20%) is better than zero data. And in the Java world CPU utilization is a very critical measurement and very important for people to make decisions with. Again, just for a moment, imagine that the tests you did ate roughly 75% of the CPU. This means an app running in the container can do very little work and still sustain that rate - add a bit more work and the max msgs/sec will drop. Now imagine that the tests ate 25% of the CPU (again +- 10%) - than someone evaulating it can know that they can put a fair load in their application and still have a hope of coming close to the numbers you quote.

    This is almost certainly one of the reasons why Bela is talking about tests with varying CPUs (number, and maybe speed) - it tells you _alot_ about what resources the underlying package requires. As it is now with your tests - were the machines quoted maxed out on CPU and thereby showing a theoretical max rate? Or was there enough idle time that you could run other code and still achieve (or at least come close to) that rate? Or perhaps was the CPU severely underutilizing, indicating a scaling problem? Don't know, can't tell without CPU numbers. And again - even just a gross approximation can give you tons of hints as to what the underlying software is doing.

         -Mike
  29. About bandwidth[ Go to top ]

    I think you missed what I was getting at. For stacks that don't interpret the payload (most I believe), when your message size gets very large you're no longer testing JGroups - you're testing the TCP/IP stack and the network. Tests with smaller messages show how well JGroups itself does e.g. it accentuates whatever overhead JGroups (or any comms package really) is adding to the equation.
    I don't agree since JGroups has to perform fragmentation for large messages and the cost of this fragmentation is larger than the cost of the IP stack. The cost of the kernel IP stack remains marginal in our experiments.
    I'm sorry but this doesn't hold water. _Every single number you give_ is dependent upon your setup. As you say "just change one thing" and your throughput and latency numbers can change. This isn't a reason not to quote the numbers - you just tell people what the setup is so they can quantify it.To put my view on this plainly - ballpark CPU numbers are far better than _no_ CPU numbers. Data with an accuracy of +- 10% (or even 20%) is better than zero data.
    I have been doing Java benchmarking in the past 3 years. Just changing the JVM with the same Linux kernel, can change the CPU usage by more than 60% (on uniprocessor machine, even worse with multiprocessors) especially when you are using communications. So I don't think CPU numbers are really relevant in Java environments.
    Just FYI, the average CPU usage on our testbed was around 70% if that means anything.

    Emmanuel
  30. About bandwidth[ Go to top ]

    I don't agree since JGroups has to perform fragmentation for large messages and the cost of this fragmentation is larger than the cost of the IP stack. The cost of the kernel IP stack remains marginal in our experiments.
    How do you know? You've stated you haven't monitored CPU utilization, and you haven't profiled the app. So how do you know?

    In addition, I've done my own tests with TCP/IP stacks. When you start talking about 100K/1MB/10MB messages, just sending the raw data out dominates your wall-clock time and is not "marginal". Go use any such test that pushes 100K/1MB/10MB messages through TCP/IP or multicast from one end point to another and tell me that the times to do so are so low that they're marginal.

    Hmmm....and, amazingly, your network utilization goes up by by orders of magnitude as you step up message sizes up to a peak for 10K messages. Why do you think that's so? And where does the bandwidth peak come from - JGroups or the network and tcp/ip stack?
    I have been doing Java benchmarking in the past 3 years. Just changing the JVM with the same Linux kernel, can change the CPU usage by more than 60% (on uniprocessor machine, even worse with multiprocessors) especially when you are using communications. So I don't think CPU numbers are really relevant in Java environments.
    Just FYI, the average CPU usage on our testbed was around 70% if that means anything.
    I want to make sure I understand - because changing a core component can radically change CPU utilization, you conclude that reporting CPU utilization is irrelevant? Well - that's a rather unique viewpoint.

    As for me - don't you think a rough average of 70% CPU utilization is rather high for 1,300 msgs/second? Further - couldn't you rather easily make the statement that on the hardware you were using you could never get over 2,000 msgs/sec - since you're already around 70% for 1,300? And did CPU utilization vary much/at all as you increased message size?

        -Mike
  31. About bandwidth[ Go to top ]

    How do you know? You've stated you haven't monitored CPU utilization, and you haven't profiled the app. So how do you know?
    I don't think that top is reliable for reporting cpu usage but it can give a very rough idea. Having a constant usage of 0.x% of kernel cpu time gives you a rough idea of the time you spend in the kernel IP stack. From previous experiences with communication profiling in J2EE servers, you can spend a lot of time in java.net classes but for me it is not the IP stack of the operating system.
    In addition, I've done my own tests with TCP/IP stacks. When you start talking about 100K/1MB/10MB messages, just sending the raw data out dominates your wall-clock time and is not "marginal". Go use any such test that pushes 100K/1MB/10MB messages through TCP/IP or multicast from one end point to another and tell me that the times to do so are so low that they're marginal.Hmmm....and, amazingly, your network utilization goes up by by orders of magnitude as you step up message sizes up to a peak for 10K messages. Why do you think that's so? And where does the bandwidth peak come from - JGroups or the network and tcp/ip stack?
    Can you tell me what is the time spent in the JVM and in the kernel. With current Ethernet adapters, a lot of processing is offloaded in the adapter. You can easily flood your network without using many cpu resources on a machine. An iowait does not consume that much resources !
    As for me - don't you think a rough average of 70% CPU utilization is rather high for 1,300 msgs/second? Further - couldn't you rather easily make the statement that on the hardware you were using you could never get over 2,000 msgs/sec - since you're already around 70% for 1,300? And did CPU utilization vary much/at all as you increased message size?
    CPU utilization varies a lot during a single experiment from 0 to 100% (average seems to be 70% but with a lot of variations). I don't really get the point with getting 2,000 msgs/s since we already can't get more that 1,300 msgs/s.
    All problems are not cpu bound and cpu is not always the most relevant metric, that was just my point. Once again, we don't have accurate profiling numbers that allows us to conclude anything about cpu usage. Probably in a next paper ;-)
  32. About bandwidth[ Go to top ]

    I don't think that top is reliable for reporting cpu usage but it can give a very rough idea. Having a constant usage of 0.x% of kernel cpu time gives you a rough idea of the time you spend in the kernel IP stack. From previous experiences with communication profiling in J2EE servers, you can spend a lot of time in java.net classes but for me it is not the IP stack of the operating system.

    [...]

    Can you tell me what is the time spent in the JVM and in the kernel. With current Ethernet adapters, a lot of processing is offloaded in the adapter. You can easily flood your network without using many cpu resources on a machine. An iowait does not consume that much resources !
    I think we're still missing each other somewhere. In my initial comments I very carefully stated and was talking about "the tcp stack _and_ the network". The point being: what are you testing? Are you trying to test JGroups, or are you trying to test your network setup? Assuming what you want to test is JGroups, then what people are going to be interested in is _the overhead that JGroups imposes_, plus how well JGroups utilizes the resources it has available to it.

    The reason I'm pointing this out is that your report seems to not comment at all on the fact that the message rates reported, particularly for smaller messages, are _terrible_. Seriously. Both the TCP and UDP tests are capping out at the same message rates of a bit over 1,000 msg/second. Take a look at other networking software and you'll see that they average 3-5 times the numbers you quote for small messages.

    The reason I'm asking about CPU numbers is because that's one important piece of the puzzle when looking at a benchmark like this. If you're steady at around 70% - that tells you something. If you're bouncing all over the place on CPU with bursts and relative lulls - that also tells you something.

    It's the same with any resource - it always helps if you can measure it. Just like with CPU, memory consumption and use is a factor. You reported out of memory conditions with larger objects in several cases - this is another piece of the puzzle. Likewise a test case w/ only 10,000 messages is yet another piece - some of your tests should be over in about 10 seconds - yikes! In my own work 100,000 msgs is the bare-bones minimum I'll run with.

    The point of this isn't so much to criticize as to point out that there's alot of funny things seemingly going on in your tests, and they jibe with other similar tests I've seen. There could be a scaling/perf problem in JGroups - or there could be something fundamentally wrong with the tests or environment.
    CPU utilization varies a lot during a single experiment from 0 to 100% (average seems to be 70% but with a lot of variations). I don't really get the point with getting 2,000 msgs/s since we already can't get more that 1,300 msgs/s.
    Maybe I'm just a hopeless pedant and a stickler for detail - but don't you want to know why?!? I sure would. And the reason for providing information like CPU usage is a double-check for people trying to validate/reproduce the results - you can use other numbers beyond throughput to double-check results. That is - you can correlate various resource utilization numbers, profiling information, and various stats captured by the tests to double-check each for consistency and correctness. If one or more sets of numbers look too low - or too high - you can use the other numbers to cross-check what might be going on, and where there might be a glitch in the test or in the environment. As an aside that's why I always try to include latency numbers in these sorts of tests - it's yet another number that can be used to correlate and cross-check.

        -Mike
  33. JGroups evaluation article[ Go to top ]

    Brian: Cameron reported the throughput of a point-to-point connection. In no way is that indicative of multicast performance, which is what matters for clustered cache coherency.

    Actually, it would be roughly the same speed for multicast UDP (same packet structure, just a different address range.) The challenge with multicast UDP is having a lot of it on a network in which there is more than one server talking, because the multicast traffic covers the full extent of the switched fabric (it goes to all endpoints.) So, as long as you have only one sender sending out multicast UDP and no one else sending ANY packets on the network, you'd see the same 95Mbit/second or so (theoretically.) This perfect condition doesn't tend to exist in the real world, in which things like acks and nacks tend to swim back upstream, colliding with the oncoming multicast traffic, not to mention the fact that more than one server typically has something to "say."

    This is one of the problems that the paper discusses: jGroups is trying to talk among a lot of servers, without all of them talking at the same time (which causes a multicast "flood.") In the paper, it comes to the conclusion that even with a gigabit ethernet network, each server in a 16-server cluster only gets about 1KB/s of data through. (With 16 servers, that accounts for only about 1/8000 of the maximum throughput of the network.) That's a pretty sure indication of a multicast flood condition.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  34. JGroups perf paper[ Go to top ]

    I haven't yet had time to read Emmanuel's paper, looking forward to reading it.

    My own preliminary tests (org.jgroups.tests.perf) on a 100Mbps switched ethernet indicated a throughput of 9.2 MB/sec max, compared to 10.5 with netperf.
    The setup was
    - 4 boxes, 4 members, 2 of them senders. Senders receive their own mcasts
    - 100Mbps switched ethernet
    - Compaq Proliant 2CPU boxes (don't know CPU speed)
    - 128MB memory (-Xmx128m)
    - Message size 8KB
    - fc-fast config (flow control, aggressive message garbage collection)

    I observed network saturation with 2-CPU boxes.

    I'm running the tests with 100000 msgs each, and msg sizes ranging from 10 bytes to 100000 bytes.
    I hope to publish those results soon.

    If I enable cheating (a.k.a bundling and compression protocols), I get even better results (e.g. 25MB/sec on a 100Mbps switched ethernet) :-)

    Bela
  35. Bela Ban's measurements[ Go to top ]

    My own preliminary tests (org.jgroups.tests.perf) on a 100Mbps switched ethernet indicated a throughput of 9.2 MB/sec max, compared to 10.5 with netperf.

    It is stranged that you don't get a better throughput with netperf. On several clusters (various IA32 and IA64) we always obtained at least 94Mb/s (11.75MB/s) on FastEthernet. We have seen large variations with Gigabit adapters but now all FastEthernet perform almost the same.

    >I observed network saturation with 2-CPU boxes.

    I didn't get this one. Does that mean that you do not saturate the network with single CPU boxes?

    >If I enable cheating (a.k.a bundling and compression protocols), I get even better results (e.g. 25MB/sec on a 100Mbps switched ethernet)

    We tried compression but it is highly dependent on your data and in fact performance was better without compression in our experiments.
  36. Bela Ban's measurements[ Go to top ]

    It is stranged that you don't get a better throughput with netperf. On several clusters (various IA32 and IA64) we always obtained at least 94Mb/s (11.75MB/s) on FastEthernet.
    We got 10.5MB/sec with netperf. That's what I measured. The theoretical max is 100/8 = ~12MB/sec, but that's never achieved because realistically you always have other traffic in the network (I didn't have the NW for myself).
    I observed network saturation with 2-CPU boxes.I didn't get this one. Does that mean that you do not saturate the network with single CPU boxes?
    Yes, with 1 CPU I got about 20% network saturation, with 2 almost 90%. Which means JGroups is CPU-bound, at least in that configuration.
    We tried compression but it is highly dependent on your data and in fact performance was better without compression in our experiments.
    Of course. My data lent itself to compression. That's why I don't include a compression protocol in my tests, it is highly app-dependent.

    Bela
  37. Bela Ban's measurements[ Go to top ]

    We got 10.5MB/sec with netperf. That's what I measured. The theoretical max is 100/8 = ~12MB/sec, but that's never achieved because realistically you always have other traffic in the network (I didn't have the NW for myself).
    On a point-to-point connection on a switched network, unless your switch is flooded with multicast or broadcast packets, you never notice a significant degradation. Was there any JGroups experiment running in parallel of this netperf measurement?
    Yes, with 1 CPU I got about 20% network saturation, with 2 almost 90%. Which means JGroups is CPU-bound, at least in that configuration.
    Could you look at /proc/cpuinfo to tell us what kind of cpu it is?
    Also I'd be interested in the brand and model of your Ethernet adapter (/proc/pci should do it).

    Emmanuel
  38. JGroups evaluation article[ Go to top ]

    Cameron claims</a> his thing transfers of over 11.4 MB/s on 100Mb connection
    Hi :) I didn't realize that Ethernet could operate at 95% utilization. Somehow I thought it was less than that ..
  39. Hi :) I didn't realize that Ethernet could operate at 95% utilization. Somehow I thought it was less than that ..
    If you have a fully switched network, it is common to get this number. The story is completely different with hubs!
  40. JGroups evaluation article[ Go to top ]

    Hi :) I didn't realize that Ethernet could operate at 95% utilization. Somehow I thought it was less than that ..
    100Mbit Ethernet can run at up to 100% of 100Mbit/second. Getting higher level constructs -- such as UDP and TCP -- to approach 100Mbit is a much bigger challenge, due to the overhead of the protocols (particularly with TCP/IP.) That's why I was surprised to see the number that I got. (e.g. Several years back, the best I could get was about 65Mbit/second in Java, so it's not my own back I'm patting, but the OS / Ethernet driver / JVM / Java standard libraries engineers who have continued to improve the stack.)

    Emmanuel: If you have a fully switched network, it is common to get this number. The story is completely different with hubs!

    Actually, that number was going through both a switch and a hub (in series). Hubs are great for testing ... if you can get the software to work well on a hub, you have a good indicator that it's not going to choke when things get busy. I'm even starting to ponder whether a hub might be the ideal device for a multicast-only network. It seems logical ..

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  41. JGroups evaluation article[ Go to top ]

    Actually, that number was going through both a switch and a hub (in series). Hubs are great for testing ... if you can get the software to work well on a hub, you have a good indicator that it's not going to choke when things get busy. I'm even starting to ponder whether a hub might be the ideal device for a multicast-only network. It seems logical ..
    Usually a respectable switch has store-and-forward capabilities that will prevent collisions that could occur with a hub. Even with point to point communications, usually the ack implies a collision with a hub, whereas a switch will store the ack and forward it when the sender is available. Unless your switch doesn't use any buffer for multicast packets, you should always notice a significantly higher throughput with a switch. To me, it does not seem that logicical that a hub should be ideal.

    Emmanuel
  42. 100 % with hub?[ Go to top ]

    Only if there's one sender and one receiver. If there are multiple transmitters trying to transmit at the same time, there are gonna be collissions, how are you going to get 100% bandwidth?
  43. 100 % with hub?[ Go to top ]

    Only if there's one sender and one receiver. If there are multiple transmitters trying to transmit at the same time, there are gonna be collissions, how are you going to get 100% bandwidth?
    Even with one sender and one receiver, the acks of the receiver will collide with the sender packets. The only ideal case would be a UDP connection where no packet is lost and no ack has to be sent (very unlikely to happen in real life!).

    Emmanuel
  44. 100 % with hub?[ Go to top ]

    I am talking purely for Ethernet level, not above that. In any case, I would like to hear from Cameron how he can obtain 100% util with hub.
  45. 100 % with hub?[ Go to top ]

    I am talking purely for Ethernet level, not above that. In any case, I would like to hear from Cameron how he can obtain 100% util with hub.

    Please refresh my memory, as to when I claimed to achieve 100%. In the future, should you desire to make false claims, please do not be so cowardly as to hide behind an alias.

    As for achieving higher flow rates for NxN communication, the key is to coordinate usage of the overall capacity of the medium. That's another way to say "flow control," of course.

    As for the question "why a hub," I would suggest that a simple hub is ideal for multicast traffic because it does not introduce any additional collision or latency penalties over and above those introduced by the multicast packets on the wire themselves. In other words, while it is the worst device for point-to-point traffic, it is an ideal device for all-points traffic, because it is -- by its own nature -- an all-points device.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  46. Comments on perf measurements[ Go to top ]

    I finally read the paper, and like to make a few comments.

    Was this done one 1 or multi-CPU machines ? JGroups is CPU-bound, so 2 CPU definitely help

    Only 10000 messages were sent. I usually send at least 100000 msgs to avoid skewed results. Maybe I misread the paper, and this was only done for the UDP configuration ?

    Section 2.2: "As TCP already provides flow control, there is no need to use JGroup flow control". This is not correct. TCP provides flow control between 1 sender and 1 receiver, however in groupcomm we have to provide flow control between 1 sender and n receivers, and have to send at the rate of the slowest receiver.

    Section 2.2: "However, due to dependencies ... STABLE layer is useless in the TCP context". This is not correct either. TCP buffers msgs for retransmission between 1 sender and 1 receiver. However, if you send messages to multiple receivers, even if they use message garbage colection 1-1, this does not apply to 1-n. Consider the case where we have members P,Q,R and we use the TCP transport. P sends message m1, and crashes after Q receives the message. TCP provides end-to-end reliability between P and Q, and P and R, but not between P and {Q,R}. So R will never get m1. With JGroups, m1 can be retrieved (and delivered to R) as long as 1 (non-faulty) member received m1 (in this case Q).
    Because we buffer messages for retransmission, we need to have STABLE, which periodically garbage collects messages seen by everyone.

    Section 3.4: the ref to JGroup (not JGroups (with an 's')) is wrong.

    Section 3.4: "[...] JGroups abstractions are not very well suited to J2EE clustered applications". Can you elaborate ? Otherwise you should not include such a statement (which is wrong IMO).

    Section 6: "[...] JGroups channel requires 28 threads [...]".
    This is not correct. You can configure a stack to have 2 threads per protocol, or none, or 1. So if you look into the JGroups/conf dir, we can have 28 threads (default.xml) or only 3 threads (fc-fast.xml). JGrouops allows you to configure this individually per protocol, allowing for optimal configuration.

    With respect to n-n performance: I have seen different results in my preliminary tests, actually on my home laptop (all 4 nodes on the same machine), I got 938 1K msgs/sec. This was slightly down with n varying from 1 - 4.
    I have to verify those results though, so they may turn out to be wrong.

    If you use a switch, and send 10 msgs to a group of 10, with IP multicast the switch would receive 10 packets (assuming no fragmentation at the IP level, e.g. by staying under the MTU), and forward them. With TCP, we have a mesh, and the switch would receive 100 packets (actually probably less because of loopback of the local TCP). Anyway, I would expect to get much better performance with n-n and IP multicast than with n-n and TCP.
    This is what I have seen before, but your results contradict this. I'm curious to see what my own perf experiments will say.

    Cheers,

    Bela
  47. Comments on perf measurements[ Go to top ]

    I finally read the paper, and like to make a few comments.Was this done one 1 or multi-CPU machines ? JGroups is CPU-bound, so 2 CPU definitely help
    Section 4.3: dual 900 MHz Itanium-2, Linux 2.4.18e31smp
    Only 10000 messages were sent. I usually send at least 100000 msgs to avoid skewed results. Maybe I misread the paper, and this was only done for the UDP configuration ?
    4.4: 4 stacks: UDP, UDP-FLOW, UDF-FC and TCP
    5.2. TCP Results
    5.4. UDP vs TCP
    Section 2.2: "As TCP already provides flow control, there is no need to use JGroup flow control". This is not correct. ... With JGroups, m1 can be retrieved (and delivered to R) as long as 1 (non-faulty) member received m1 (in this case Q).
    I agree like we said in the second paragraph but JGroups does not provide uniform reliable multicast which means that it does not ensure that either everybody receives a message or not if a failure occurs. A message might have been delivered locally to the sender (and handled) and then the sender fails but nobody else will ever receive the message. This is too bad if the handler does an update in the system (like an update in a remote database).
    Because we buffer messages for retransmission, we need to have STABLE, which periodically garbage collects messages seen by everyone.
    In the TCP case, this is far from optimal since the TCP channel is reliable and ordered.
    Section 3.4: the ref to JGroup (not JGroups (with an 's')) is wrong.
    Thanks we will fix it.
    Section 3.4: "[...] JGroups abstractions are not very well suited to J2EE clustered applications". Can you elaborate ? Otherwise you should not include such a statement (which is wrong IMO).
    You are right, a link if missing to section 6.
    Section 6: "[...] JGroups channel requires 28 threads [...]".This is not correct.
    Just read the text you have skipped from the sentence: "... using the default configuration stack, JGroups channel requires 28 threads ...". This is correct.
    You can configure a stack to have 2 threads per protocol, or none, or 1. So if you look into the JGroups/conf dir, we can have 28 threads (default.xml) or only 3 threads (fc-fast.xml). JGrouops allows you to configure this individually per protocol, allowing for optimal configuration.
    Then, why did you not use fc-fast.xml for the default stack?
    With respect to n-n performance: I have seen different results in my preliminary tests, actually on my home laptop (all 4 nodes on the same machine), I got 938 1K msgs/sec.
    We never obtained meaningful results when doing tests locally to a machine (single or dual cpu). It is very hard to profile (test application, client handlers, ...).
    Anyway, I would expect to get much better performance with n-n and IP multicast than with n-n and TCP.This is what I have seen before, but your results contradict this. I'm curious to see what my own perf experiments will say.
    We first expected the same behavior and we observe it for 1-n communications but it turns out that IP multicast performance is terrible with collisions and TCP flow control is way better in these conditions. This was also quite a surprise for us. Also point-to-point flow control seems to be better than group flow control probably due to the fact that you have to align the group on the slowest machine. If you want to do flow control per connection, then the kernel TCP flow control is better than what we can do in Java on top of UDP.

    Thanks a lot for your comments, don't hesitate to share your thoughts.
    Emmanuel
  48. Totem Plug[ Go to top ]

    This is a benchmark I did with Totem (total ordering, multicasting):

    http://www.evs4j.org/evs4j-0_9_1/BENCHMARK.txt

    Quick summary:

    2 nodes
    1450 message payload
    6000 messages/sec throughput
    1.5 ms average latency
  49. Totem Plug[ Go to top ]

    This is a benchmark I did with Totem (total ordering, multicasting):http://www.evs4j.org/evs4j-0_9_1/BENCHMARK.txt
    I cannot access the link. Is it correct or the link temporarily down?

    Emmanuel
  50. Totem Plug[ Go to top ]

    This is a benchmark I did with Totem (total ordering, multicasting):http://www.evs4j.org/evs4j-0_9_1/BENCHMARK.txt
    I cannot access the link. Is it correct or the link temporarily down?Emmanuel
    It's down - I have to yell at the RoadRunner people ..
  51. Totem Plug[ Go to top ]

    Since my site is down I am pasting part of that page here:
    This is how you run the test application:

    ./benchmark.sh <node-list>

    The <node-list> is a list of integers, usually 1, 2, etc .. which serve as the
    names of the nodes. You can run more than one totem thread in the jvm if you
    only have one machine, but for benchmarking you really should use one node per
    machine.

    For example, in my LAN I usually run the benchmark using nodes '1' and '2' on
    different machines.

    machine-a$ ./benchmark.sh 1
    machine-b$ ./benchmark.sh 2

    Note that Totem is only as fast as your slowest machine, so if you are looking
    for speed use identical machines.

    This is a sample output from the first machine:

    New session: [4294967299 {1}]
    New session: [4294967301 {1}]
        throughput = 4,690
     rotation time = 0
            window = 30
    retransmission = 0

    New session: [4294967303 {1, 2}]
    New session: [4294967305 {1, 2}]
        throughput = 5,926
     rotation time = 3
            window = 30
    retransmission = 0

        throughput = 5,851
     rotation time = 3
            window = 30
    retransmission = 0

        throughput = 6,026
     rotation time = 3
            window = 30
    retransmission = 0

    ...

    As you can see processor 1 first forms a ring by itself, and achieves a
    thruput of 4,690 packets per second. Shortly after that, node 2 comes up,
    and they form a new ring, reaching a higher thruput of about 6,000 packets
    per second.

    Class PacketTest uses two threads: one to continuously enqueue messages to be
    sent out, and another thread to run the Totem protocol itself. The payload
    of the messages is 1450 bytes.

    It is interesting to see what the network utilization is. In my case we get:

    6,026 * 1450 * 8 (bit/s)

    That's about 70Mbps, or about 9Mb/s.

    This was on the following hardware:

    Machine 1:
    P4 1.4Ghz, Intel 850 chipset, 400Mhz FSB, 256Mb RDRAM (Rambus memory)
    Intel EtherExpress 10/100, e100 driver, 82557 chipset (an old nic)
    Linux 2.4.*, J2SE 1.4.2_03 (Sun - Java HotSpot Client VM)

    Machine 2: same as machine 1 (again, the slowest machine determines the
    thruput.)

    [..]