Optimizing WildFire, An Exercise in Performance

Discussions

News: Optimizing WildFire, An Exercise in Performance

  1. Gaston Dombiak and Matt Tucker put together a great article that highlights their experiences performance tuning WildFire, an XMPP server. In this article they hit the high notes and low as they profile and fix the performance bugs they ran into.

    In addition to a practical demonstration of a performance tuning exercise, they also offer some great advice on when and when not to optimize. They also provide some insight on how to interpret the data produced by profiling tools.
    • Never optimize before it's needed.
      We always build code to be simple and easy to maintain. We only optimize the code when it becomes specifically clear that it's necessary.
    • Clear code is better than a 0.5% performance improvement.
      We don't apply a minor optimization when it will make the code much harder to understand or maintain. Over-optimization is an easy trap to fall into.
    • Profilers can lie.
      Profilers can sometimes give false-positives for hotspots, which can be a really tricky problem.

    Is what Gaston and Matt found to be typical of your experiences? If not, what kind of performance improvements do you typically see and where do you find your performance boosts coming from?

    Threaded Messages (28)

  2. architecture factors and decisions ?[ Go to top ]

    This looks cool but I would like to know if they tried exploring other options and what factors led them to this particular architecture.

    For an IM type application which expects 10k+ clients , there could be more options :

    a) going P2P like JXTA and avoiding connections to the server as much as possible.

    b) or use Async IO solution from IBM alphaworks -- that can scale to 10k+ sockets easily. I could be wrong , buy even plain old NIO can achive this to some extent.
  3. architecture factors and decisions ?[ Go to top ]

    This looks cool but I would like to know if they tried exploring other options and what factors led them to this particular architecture.

    The particular architecture choice is obviously the biggest contributor to overall performance. One architecture can out-perform another by 10X or 100X, whereas with profiling and optimizing you're doing well to get a 2X or 3X improvement.

    However, at some point you have to optimize the architecture you have. :) It's great to see someone sharing their experiences doing that!

    James
  4. re: architecture factors and decisions ?[ Go to top ]

    Our aim is to be able to support 100K connections in the next year. The only way to do that is through connection aggregation. In other words, multiple slave servers will handle around 10K client connections (using NIO, Asynch IO, etc). Each slave server then connects to a single central server and all client packets aggregated over this link. This architecture is what we're working on for the Pampero project:

    http://www.jivesoftware.org/community/entry!default.jspa?categoryID=17&externalID=422

    One consequence of Pampero is that the central server has to be able to process packets *really* fast. That's what we were focusing on for this round of optimization.

    Of course, Pampero isn't the final evolution of the architecture. We'd also like to support clustering of central servers using something like Coherence. This will provide fault tolerance and hopefully even better scalability.

    BTW, I'd definitely encourage everyone to check out Wildfire for use in their own organization. With Google Talk using XMPP, the protocol is very quickly gaining traction. It's a much more feature-rich and secure alternative to the public networks. The fact that Wildfire is Open Source always makes it an easy sell as well. :)

    Regards,
    Matt
  5. Dead sockets[ Go to top ]

    Because the underlying TCP/IP layer doesn't always inform Java about dead sockets, it used to be possible for the server to wait forever in a socket flush or write operations.

    I have never seen this on Linux. I wouldn't be too surprised if this turned out to be an "optimization" in the Windows tcp implementation.
  6. Dead sockets[ Go to top ]

    remember that the window's tcp stack is borrowed from netbsd so an optimised windows server isn't actually that bad. out-of-the-box, the performance is poor tho..and then the windows tools are rubbish - i always miss tools like lsof or just being able to grep netstat :(
  7. Dead sockets[ Go to top ]

    I remember seeing this behavior on both Windows and Solaris back when Java 1.3 was just released. It seemed to me then to be a bug in the Java native libraries, but I didn't delve that deeply into it. Never tried on Linux.
  8. Exceptions?[ Go to top ]

    Don't throw exceptions for non-exceptional behavior.

    Did somebody understand how much cpu was saved by doing this? I find it unbelievable that in a network server the bottleneck would be throwing exceptions ..
  9. Exceptions?[ Go to top ]

    Did somebody understand how much cpu was saved by doing this? I find it unbelievable that in a network server the bottleneck would be throwing exceptions ..

    Of course, that was a pretty minor optimization. However, the exception was getting thrown often enough to measurably impact performance. I also liked including that optimization in the list because we believe in general that it's not a good idea to throw exceptions for non-exceptional behavior.

    The 33% total increase in performance on a single CPU was the result of adding up all the optimizations together.

    Regards,
    Matt
  10. Exceptions?[ Go to top ]

    Did somebody understand how much cpu was saved by doing this? I find it unbelievable that in a network server the bottleneck would be throwing exceptions ..
    Of course, that was a pretty minor optimization. However, the exception was getting thrown often enough to measurably impact performance. I also liked including that optimization in the list because we believe in general that it's not a good idea to throw exceptions for non-exceptional behavior. The 33% total increase in performance on a single CPU was the result of adding up all the optimizations together.Regards,Matt

    That's a design decision. IMHO the line "don't use exceptions for things that happen all the time" is not so justifiable. However if you want to say "don't force people to write a lot of catch blocks" I would agree with that. It all depends.

    I suppose one can measure the overhead of throwing an exception vs. returning something using a simple test. I suspect that it's small and probably no. 50 in the sequence of bottlenecks in any actual running system.
  11. Exceptions?[ Go to top ]

    Did somebody understand how much cpu was saved by doing this? I find it unbelievable that in a network server the bottleneck would be throwing exceptions ..
    Of course, that was a pretty minor optimization. However, the exception was getting thrown often enough to measurably impact performance. I also liked including that optimization in the list because we believe in general that it's not a good idea to throw exceptions for non-exceptional behavior. The 33% total increase in performance on a single CPU was the result of adding up all the optimizations together.Regards,Matt
    That's a design decision. IMHO the line "don't use exceptions for things that happen all the time" is not so justifiable.
    Have you confused run time issues with syntax? I ask this because of the references of not forcing people to having lots of catch blocks. Anyways the line is completely justifiable in a run time. Exceptions are quite disruptive to the normal flow of an appliction execution stack. The disruption occurs on two primary levels. First execution must be stopped while the VM takes a snapshot of the current stack and secondly the execution engine must search for a suitable catch block and then adjust the stack and program counters before execution can continue. This is an extremely expensive operation. OTOH having lots of catch blocks is of no cost in the execution environment. That said it is a drag in development.

    Kirk

    www.javaperformancetuning.com
  12. Execution must be stopped![ Go to top ]

    Exceptions are quite disruptive to the normal flow of an appliction execution stack. The disruption occurs on two primary levels. First execution must be stopped while the VM takes a snapshot of the current stack

    Execution is "stopped" every time you are between instructions. But you make it sound like 'stopping' execution is an expensive operation. It's not like it has to switch context. Fine, so it's a bit more expensive because it has to create an exception object and that exception object has to have the stack trace in it.
    and secondly the execution engine must search for a suitable catch block and then adjust the stack and program counters before execution can continue. This is an extremely expensive operation.

    Granted that I have never implemented a compiler for exceptions, but the following thoughts come to mind:

    1. The search is at most linear in the number of catch blocks ;)

    2. In 90% of cases there is only one catch block. Hence the search is linear in n=1.

    3. Since exceptions are typed even when there is a long list of catch blocks the compiler can still trim the search down to a sub-list.
    OTOH having lots of catch blocks is of no cost in the execution environment.

    I see, you mean the execution environment as opposed to the execution environment :-)
  13. Execution must be stopped![ Go to top ]

    Granted that I have never implemented a compiler
    Compiler is not the issue
    the following thoughts come to mind:1. The search is at most linear in the number of catch blocks ;)2. In 90% of cases there is only one catch block. Hence the search is linear in n=1.3. Since exceptions are typed even when there is a long list of catch blocks the compiler can still trim the search down to a sub-list.
    How about this. I have a micro benchmark for my performance tuning course. If the method (to parse an int) terminates normally the run time is 703ms on my PIV 3.2MHz machine. If you feed that same method with corrupt data it will terminate with an exception. Run time for that case is 3750ms. Since a successful conversion is run to completion, finding an error in the data is no more expensive then completing the conversion.

    Kirk


    www.javaperformancetuning.com
  14. Execution must be stopped![ Go to top ]

    I have a micro benchmark for my performance tuning course. If the method (to parse an int) terminates normally the run time is 703ms on my PIV 3.2MHz machine. If you feed that same method with corrupt data it will terminate with an exception. Run time for that case is 3750ms.

    Kirk, if it takes your code 703ms to parse an int, then there's more going wrong than just an exception being thrown ;-)

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Shared Memory for Java
  15. Execution must be stopped![ Go to top ]

    I have a micro benchmark for my performance tuning course. If the method (to parse an int) terminates normally the run time is 703ms on my PIV 3.2MHz machine. If you feed that same method with corrupt data it will terminate with an exception. Run time for that case is 3750ms.
    Kirk, if it takes your code 703ms to parse an int, then there's more going wrong than just an exception being thrown ;-)
    lol... Ok wise guy... you've heard of over-clocking!!!! Well I've started a new trend called under-clocking..... you caught me.. darn...


    Kirk
  16. Execution must be stopped![ Go to top ]

    I have a micro benchmark for my performance tuning course. If the method (to parse an int) terminates normally the run time is 703ms on my PIV 3.2MHz machine. If you feed that same method with corrupt data it will terminate with an exception. Run time for that case is 3750ms.
    Kirk, if it takes your code 703ms to parse an int, then there's more going wrong than just an exception being thrown ;-)Peace,Cameron PurdyTangosol Coherence: Clustered Shared Memory for Java

    Cameron, I think he meant the total running time for parsing an int a large number of times. If you are being sarcastic in text do you _have_ to use an emoticon? I never understood this ...

    Enjoy the Fastest Known Total Ordering Protocol
  17. Execution must be stopped![ Go to top ]

    Cameron, I think he meant the total running time for parsing an int a large number of times. If you are being sarcastic in text do you _have_ to use an emoticon? I never understood this ...

    It's that American dry humour ;-)
    Enjoy the Fastest Known Total Ordering Protocol

    Total ordering? OK, I'll take a look. I'm not a fan of total ordering (as discussed previously), but I'm always interested in seeing what interesting things other people are inventing / implementing. We use async reliable messaging (ordered on a point-to-point basis) in grid environments up to thousands of boxen, and something tells me that total ordering would impact that throughput pretty badly.

    Peace,

    Cameron Purdy
    Tangosol Coherence: The Java Data Grid
  18. Total ordering[ Go to top ]

    We use async reliable messaging (ordered on a point-to-point basis) in grid environments up to thousands of boxen, and something tells me that total ordering would impact that throughput pretty badly.Peace,Cameron Purdy

    It has very narrow applicability, but I finally decided to do something with it because for workloads with 100% sharing it's the logical way to go.

    I have recently found that there are adventurous it people out there who will do anything to build the right system, so this leads me to believe that somebody will use it.
  19. Execution must be stopped![ Go to top ]

    How about this. I have a micro benchmark for my performance tuning course. If the method (to parse an int) terminates normally the run time is 703ms on my PIV 3.2MHz machine. If you feed that same method with corrupt data it will terminate with an exception. Run time for that case is 3750ms. Since a successful conversion is run to completion, finding an error in the data is no more expensive then completing the conversion.

    What do you attribute the overhead to exactly? "Stopping execution" means nothing. Tell me memory allocation, string creation (for the stack trace), tell me something real.

    Feel free to run javap -c on this code and post the result so we have some actual code to talk about.
  20. Execution must be stopped![ Go to top ]

    How about this. I have a micro benchmark for my performance tuning course. If the method (to parse an int) terminates normally the run time is 703ms on my PIV 3.2MHz machine. If you feed that same method with corrupt data it will terminate with an exception. Run time for that case is 3750ms. Since a successful conversion is run to completion, finding an error in the data is no more expensive then completing the conversion.
    What do you attribute the overhead to exactly? "Stopping execution" means nothing. Tell me memory allocation, string creation (for the stack trace), tell me something real.Feel free to run javap -c on this code and post the result so we have some actual code to talk about.

    May I suggest that you don't trust my results or code and please construct your own benchmark. With each new JVM condidtions change and consequently we must question continously question micro performance tips such as this one all the time.

    Kirk


    www.javaperformancetuning.com
  21. Execution must be stopped![ Go to top ]

    May I suggest that you don't trust my results or code and please construct your own benchmark. With each new JVM condidtions change and consequently we must question continously question micro performance tips such as this one all the time.Kirk www.javaperformancetuning.com

    Sounds good.

    I still feel that people should optimize one bottleneck at a time.
  22. Useful website. Thanks for putting it together.
  23. Execution must be stopped![ Go to top ]

    Why not time it? I just wrote a little test harness to measure the difference between calling a method which returns true or false, vs. a method with throws or doesn't.

    Calling that method 1000000 times, I found that the version where I returned either true or false (depending on Random.nextBoolean) ran 14 times faster than the version where I either threw a RuntimeException, or didn't, depending on the same criterion. All other things being equal, it's a performance loss to use an exception to represent normal logic flow, and it's easy to prove.
  24. Exceptions?[ Go to top ]

    Isn't this an issue you would regard as a code smell and refactor right in the next code review? I refer to the "replace error code with exception" refactoring. Null is nothing but a "special error code", whereas exceptions are clear indications about something not going the way the code would like it. The search method is required to return a route instance. If it is unable to return one, this is exceptional behavior in the context of interface semantics.

    I would regard this improvement as prone to re-refactoring and re-optimization changes, whatever motivation you base your current code work on. And it requires client code change every time it gets modified.

    So wasn't this a violation of the 0.5% vs. clear code rule?

    Regards,
    Daniel
  25. An XMPP server should spend most of its time doing network operations or parsing XML since those are two things that can never be optimized away.

    Actually, I would have said that you _can_ optimize the parsing of XML. It sounds like this XMPP/Jabber protocol is designed for stateful, low-latency, xml-based communication with IM specifically in mind. I would guess that in this application most messages are short (as in IM).

    Probably the most efficient way of processing xml these days is to keep it represented in binary, and compile a special parser into bytecode from the schema (I think xsltc does something similar), and parse the message from the beginning as many times as needed. This is because CPUs are extremely fast, and a small message could fit in the L1 or L2 cache. Whereas if you break up a message into a tree of strings they will all be allocated in different places in memory and will lead to poor cache utilization.
  26. Actually, I would have said that you _can_ optimize the parsing of XML. It sounds like this XMPP/Jabber protocol is designed for stateful, low-latency, xml-based communication with IM specifically in mind. I would guess that in this application most messages are short (as in IM).Probably the most efficient way of processing xml these days is to keep it represented in binary, and compile a special parser into bytecode from the schema

    Yep, that's a good point. Of course, clients connecting to the server can't be expected to use binary XML -- that doesn't follow the XMPP spec. However, for Pampero we're definitely thinking about using binary XML between the slave servers and central server. There's a very interesting lib called Nux (http://dsd.lbl.gov/nux/) that looks like it has a very efficient binary XML format.

    -Matt
  27. Yep, that's a good point. Of course, clients connecting to the server can't be expected to use binary XML -- that doesn't follow the XMPP spec.

    I didn't mean that you should switch to a binary xml format. I meant that the format stays the same but you don't create a tree of objects when you parse. To read the value of an attribute you can express the query as xpath which in turn can be compiled and parse the whole document every time. The document should be quite small I imagine.
    However, for Pampero we're definitely thinking about using binary XML between the slave servers and central server.

    What do the slave servers do? You don't mean primary-backup replication?
  28. You _can_ optimize xml processing[ Go to top ]

    I would do a couple of things I think, particularly if you're planning on a binary protocol from the edge servers to the core.

    One, I would think it would be straightforward to write a dedicated parser for the "common case" packets.

    I don't know what that structure is, but I'm guess if it's handling packets like any other IM system, a VAST majority of the packets are basically "the same", i.e. a simple message.

    I understand that the protocol is extensible, but that doesn't mean that the common case isn't just insanely common.

    If the simple parser detects something amiss from its expected simply format, then it can punt to a more general parser to handle the everything else. Yes, it would probably be doing redundant work, but, again, it would be the exception and not the rule.

    Next, I would make the common case parse directly into the binary format object, and I would make that format object as simple as possible. I may even go as far as to not having Java Strings be created until after it's been pushed down to the pipe to the central server anyway. Just send back char or byte arrays with a header giving the offsets to the fields. And I would make the common case "look" just like the general case, that is, something that extends the base packet form and behaves identically. The back end server is going to have to build those Strings for you no matter what, but your edge servers may not have to.

    Mind, there's no reason it is actually STRUCTURED internally like a normal packet with pointers and nodes and other things. For example, you mention you use a DOM. The common case wouldn't be a DOM at all, it would simply look like one. I imagine the common case is actually a very flat tree (or, perhaps, a tall tree with few or no branches).

    Simply, your common case will have a fairly consistent and basic path through your system, so there's no reason not to car pool them and put them in the diamond lane, save that it can complicate the system for the special case. But truth be told, the rest of the packets are the "special case", rather than the common message packet.
  29. Looks like there are N frontends and a single shared backend master. So the goal would be to reduce the load on the master, and a binary XML approach would do that nicely.

    With small messages you should easily be able to parse well beyond 100000 msg/s (measured here), especially when reusing the parser object (e.g. via a SoftReference based ThreadLocal). Looks like that would be well beyond the target requirements.

    If that's not what you're seeing, you're probably not using the API in an optimal way, in which case I can help you cleanly fix that provided an example snippet and profiler output.

    If the frontends are still too slow wrt. parsing standard XML, you could experiment with the BuilderPool using xerces-2.7.1 or the StaxBuilder using Woodstox.

    Wolfgang.