Discussions

News: Featured Article: Dispelling NIO Myths

  1. Featured Article: Dispelling NIO Myths (38 messages)

    Mike Spille has moved on from XA, and is now enjoying spending time with NIO. He is coding EmberIO, which is some open source code that works with NIO, and he is trying to dispel some of the NIO myths.

    Excerpt
    The biggest bang for the buck you'll get out of EmberIO is in the buffer management, built-in support for a variety of threading models, and from generally getting rid of the pain usually associated with NIO. Also keep in mind that while one of the initial motivations behind EmberIO is to make using NIO easier, it is being expanded to handle non-NIO I/O resources as well. EmberIO is moving towards being a general low-level I/O sub-system that puts a single consistent interface on top of disperate I/O resources and threading models.
    Read Mike Spille in EmberIO - Dispelling NIO Myths

    Threaded Messages (38)

  2. I thought it might be worth posting a note clarifying the acronyms here, since it wasn't necessarily clear unless one is already familiar with the context:

    NIO = the new Java Native Input/Output subsystem

    The NIO package added to Java 1.4 lets Java interface with Native IO devices directly without the overhead of the double buffering of traditional Java IO (Java Memory then being recopied into Native memory).
  3. Cameron P. says NIO is slow...[ Go to top ]

    I'm now a little bit confused when I remember Cameron P. experience with NIO..
    did you experience the same ?

    Laurent.
  4. Cameron P. says NIO is slow...[ Go to top ]

    Hi Laurent,

    Just to be clear, I was comparing in-memory speed between Java memory access (e.g. primitive arrays) and NIO buffers. I was thinking more about video buffering tasks and things like that, not "actual" IO. What Mike is looking at is highly concurrent IO tasks (throughput) while I was looking at single-threaded latencies (raw performance).

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  5. Cameron P. says NIO is slow...[ Go to top ]

    To expand a bit on what Cameron is saying....

    One of NIO's features is direct ByteBuffers. "Direct" means that these buffers are allocated outside of the JVM memory space, and are intended to be closer to the I/O devices you're trying to deal with. The idea is that you read from/write to memory that these devices can see directly (with the help of the operating system virtual memory system) and avoid unnecessary copying of data.

    Cameron tested _only_ NIO direct ByteBuffers, and compared them to plain old byte arrays. That is he compared write and access times of:

        ByteBuffer.allocateDirect(size)

    vs.

        byte[] myArray = new byte[size];

    The interesting twist, and one that shows just how unique a thinker that Cameron is, is that his tests didn't do any I/O :-) All manipulation was in-memory. What he found is that writing to/reading from direct ByteBuffers was slower than writing to/reading from byte arrays in Java.

    In my own work, I differ from Cameron in that I'm actually doing I/O - socket I/O in this case. One of the goals of EmberIO is to show that NIO can approach old-style blocking I/O (BIO) in terms of latency, by being smart about how it deals with NIO resources like Selectors and how it deals with endpoints like sockets. And while doing that, being much more configurable and potentially more scalable than BIO can be.

        -Mike
  6. To expand a bit on what Cameron is saying....One of NIO's features is direct ByteBuffers. "Direct" means that these buffers are allocated outside of the JVM memory space, and are intended to be closer to the I/O devices you're trying to deal with. The idea is that you read from/write to memory that these devices can see directly (with the help of the operating system virtual memory system) and avoid unnecessary copying of data.Cameron tested _only_ NIO direct ByteBuffers, and compared them to plain old byte arrays. That is he compared write and access times of: ByteBuffer.allocateDirect(size)vs. byte[] myArray = new byte[size];The interesting twist, and one that shows just how unique a thinker that Cameron is, is that his tests didn't do any I/O :-) All manipulation was in-memory. What he found is that writing to/reading from direct ByteBuffers was slower than writing to/reading from byte arrays in Java.In my own work, I differ from Cameron in that I'm actually doing I/O - socket I/O in this case. One of the goals of EmberIO is to show that NIO can approach old-style blocking I/O (BIO) in terms of latency, by being smart about how it deals with NIO resources like Selectors and how it deals with endpoints like sockets. And while doing that, being much more configurable and potentially more scalable than BIO can be. -Mike
    I've posted this in the past, but I ran into scalability problems with Object streams when writing a custom transport for JBoss. It boils down to some static variables being synchronized within java.io.ObjectStreamClass

        /** cache mapping local classes -> descriptors */
        private static final SoftCache localDescs = new SoftCache(10);
        /** cache mapping field group/local desc pairs -> field reflectors */
        private static final SoftCache reflectors = new SoftCache(10);

    I walked through some hoops to change the use of SoftCache to a ConcurrentReaderHashMap and throughput started increasing 20-40% depending on how many threads you run through it. It is not an issue if you're just marshalling primitives, but becomes an issue of course if you're marshalling objects. I have the code for this, but it is a derivative of JDK 1.4 source which means it is illegal to distribute. Email me if you want it. I'm interested in any legal work-around if you have one.

    I'd also be interested in incorporating some of your research into a JBoss transport. Email if you're interested and I'll point you to the appropriate example code.

    Bill
  7. I'd also be interested in incorporating some of your research into a JBoss transport. Email if you're interested and I'll point you to the appropriate example code.Bill
    Wow! Mike will get a new job, soon! :-)
  8. Could you send patch to me?[ Go to top ]

    Email if you're interested and I'll point you to the appropriate example code.
    Hi, we are seeing same issue in 1.4.2 with no potential for upgrade. The commercial product we are using will not move to java 5 until second half of 2008. Could you send that patch to me? email (remove NOSPAM) ext-michael.1.shambergerNOSPAM@nokia.com
  9. Cameron P. says NIO is slow...[ Go to top ]

    One of NIO's features is direct ByteBuffers. "Direct" means that these buffers are allocated outside of the JVM memory space, and are intended to be closer to the I/O devices you're trying to deal with. The idea is that you read from/write to memory that these devices can see directly (with the help of the operating system virtual memory system) and avoid unnecessary copying of data.
    It's also worth noting that the performance improvement you get using direct buffers, selectors etc. can vary from one version of the JDK to another and from one platform to another.

    This can happen because:

    (1) The JVM hasn't got support in depth for the particular platform's I/O layer and is falling back to generic mechanisms. If this happens, you can end up with less performance than standard I/O simply because nio is being emulated on top of standard I/O which is pure overhead. I believe this has been the case with some of the buffer implementations in the past.

    (2) The OS doesn't provide the appropriate features to make this work.

    (3) The JVM uses the "standard" API's for implementing select and the like which aren't the most scalable. For example, linux has a number of means by which to implement selectors and the Blackdown team saw fit to modify SUN's JDK 1.4.2 for Linux as follows: "Added an epoll based java.nio.channels.Selector implementation which provides better scalability than the default poll based implementation."
  10. Cameron P. says NIO is slow...[ Go to top ]

    Dan - yep, you're correct. One of the reason for EmberIO's various I/O models and what I call "strategies" is to allow developers to configure it to match both their basic I/O needs and also the realities of the underlying JVM. EmberIO lets you transaparently configure for "N" selectors, and has a number of configuration options that can help maximize how much work can be done w/out the intervention of the Selector. These sorts of options can make your I/O work very efficiently even on a JVM implementation that's not so hot in the NIO department - and still get the benefits of non-blocking and configurable threading. It's not a universal panacea - I can't do much about things like the buffer implementation, for example - but in my testing it's good enough for many scenarios to make NIO look attractive again.

    As a side benefit, you can use EmberIO to directly compare blocking BIO style semantics with aggressive NIO non-blocking ones solely by changing your configuration, not your code. You can set up seperate tests in BIO and NIO styles and do a more apples-to-apples comparison with your own application to see which works best in your environment for latency/throughput/resource consumption/etc.

    This approach also means you can deploy on environments where the Selector or non-blocking semantics are poor by using a BIO configuration instead, with dedicated threads per socket. And that you can then switch it around to use aggressive NIO for environments where NIO is better implemented.

    As I said, this isn't a 100% comparison to pure BIO as you would do it in Java 1.3 or prior - there are still channels and ByteBuffers involved. But still, all in all, this approach allows for much more realistic testing of the benefits of BIO vs. NIO for your own code in your own environment. If nothing else, you can isolate the non-blocking and Selector aspects of NIO and see what effect they have on your app when they're used, and when they're taken away.

    In this respect, EmberIO isn't so much about targetting NIO, as it is about trying to unify I/O models from a developer's perspective and allowing tuning of the model via configuration at deployment time.

        -Mike
  11. Cameron P. says NIO is slow...[ Go to top ]

    I differ from Cameron in that I'm actually doing I/O - socket I/O in this case. One of the goals of EmberIO is to show that NIO can approach old-style blocking I/O (BIO) in terms of latency, by being smart about how it deals with NIO resources like Selectors
    Have a look at:
    http://reattore.sourceforge.net/

    It uses socket channels and selectors and the performance is very good. Michael Hope who wrote Reattore did it mainly for experiment purposes, but I think that Michael did good job with that. Code is clean and well designed.
  12. Help with serialization[ Go to top ]

    I found one big problem with NIO. Interface is pretty fine for me, however I found problem when serializing objects through network.
    If you want register sockets in selectors they have to be in non-blocking mode. If you want read serialized object you need blocking socket. If you want do this you have to deregister socket from selector and that is most expensive operation.
    Can I escape somehow from this trap?
  13. Help with serialization[ Go to top ]

    I found one big problem with NIO. Interface is pretty fine for me, however I found problem when serializing objects through network.If you want register sockets in selectors they have to be in non-blocking mode. If you want read serialized object you need blocking socket. If you want do this you have to deregister socket from selector and that is most expensive operation.Can I escape somehow from this trap?
    Wouldn't it be a solution to serialize your objects to a byte[] blob via ByteArrayOutputStream, wrap that array in a Buffer and transfer it?
  14. Help with serialization[ Go to top ]

    Primary problem is with end detection. No bytes don't mean end of stream specially on slow networks. You can avoid it by sending length at begining and then wait till you receive all bytes.
    However I dislike this solution because of space. Some object are really big (50-100 kB) and I have to duplicate them in memory when I allocate byte array where they should be serialized. Specially when you don't know size in forward and many byte array reallocations appear. It's very dirty workaround for simple streaming.
    Your performance boost gained via NIO is lost and you can use standard IO with blocking sockets, available() call and simple streaming.
  15. Help with serialization[ Go to top ]

    Primary problem is with end detection. No bytes don't mean end of stream specially on slow networks. You can avoid it by sending length at begining and then wait till you receive all bytes.However I dislike this solution because of space. Some object are really big (50-100 kB) and I have to duplicate them in memory when I allocate byte array where they should be serialized. Specially when you don't know size in forward and many byte array reallocations appear. It's very dirty workaround for simple streaming.Your performance boost gained via NIO is lost and you can use standard IO with blocking sockets, available() call and simple streaming.
    My solution for this is by writing a new object that inherits OutputStream abstract class. In my case I named the object as ByteBufferArrayOutputStream. It is similar to ByteArrayOutputStream, only it uses ByteBuffer instead of primitive array as the buffer. To manage the ByteBuffer objects that this object uses, I wrote another object called ByteBufferPool that acts as the pool for ByteBuffer instances where you can retrieve new instances of ByteBuffer objects (either direct or non-direct) as necessary and then return them back to the pool for future use. As you all know, the overhead of creating a direct ByteBuffer object is high, that’s why an object pool is really necessary. I can then use ObjectOutputStream with my ByteBufferArrayOutputStream object as the output stream media. You can call toByteBufferArray() method of ByteBufferArrayOutputStream object to create an array of containing the ByteBuffer objects and then call write(ByteBuffer[]) method of SocketChannel to send it to network.

    I will be glad to share the code with anyone who’s interested. Just drop me an email. Sorry that I don’t have a website to put the code in.
  16. Help with serialization[ Go to top ]

    What about reading from socket?
  17. Help with serialization[ Go to top ]

    Similarly for reading from socket, I wrote an object that inherits InputStream abstract class. I named this object ByteBufferLinkedListInputStream. The constructor of this class requires you to pass the instance of LinkedList object where you put the ByteBuffer objects containing the packets you’ve read from the network using read(ByteBuffer) method of SocketChannel object. Then you can use InputObjectStream using ByteBufferLinkedListInputStream as the input stream media.
    I use LinkedList instead of primitive array of ByteBuffer to increase the reuse of the ByteBufferLinkedListInputStream object, thus you don’t have to create a new Input Stream instance every time a new set of ByteBuffer objects come from the socket. Instead, you just add the new ByteBuffer objects containing the new packets to the LinkedList object, and ByteBufferLinkedListInputStream will read it in FIFO manner. When all bytes are read from one ByteBuffer object, ByteBufferLinkedListInputStream will return the ByteBuffer object back to the pool.

    I hope it helps.. :)
  18. Help with serialization[ Go to top ]

    OK I got your idea about reimplementing/simulating InputStream for ObjectInputStream. However I still don't understand how you detect end of bytes from InputSocket. You can easily get situation that ObjectInputStream would like to read data which you still don't have. Should I somehow simulate blocking read for ObjectInputStream via wait/notify which will add some more complexity or I misunderstand something?
  19. Help with serialization[ Go to top ]

    The idea for getting this to work is to inject some protocol control information around what the streams are doing. The EmberIO support for serialization is weak as this point, but functional, and what it does on writing is:

       - Creates a ByteBufferOutputStream which is backed by a ByteBuffer.
         This is referred to as a variable as bbos.
       - Writes into the ByteBufferOutputStream, called byteBuffer here:

             - byteBuffer.putInt (Packet.MAGIC);
             - byteBuffer.putInt (0); // Place holder for length!
       - ObjectOutputStream oos = new ObjectOutputStream (bbos);
       - bbos.writeObject();
       - bbos.flush();
       - byteBuffer.putInt (4, (int)(byteBuffer.position() - 8));

    So I write a "magic number" to indicate that the right sort of thing is talking to us (and to detect errors), then a place holder int of 0, then instantiate an ObjectOutputStream using a ByteBufferOutputStream which under the covers uses the same ByteBuffer. When you call writeObject() on that OOS, it's using the underlying ByteBuffer - which is at position 8 relative to the beginning of the buffer. When the writeObject() all completes, I calculate the size of the data written and store that back into the second int. Then I blam this over the wire. So what goes over the wire is:

       - MAGIC Number
       - Length
       - Object bytes

    On the reading end, you read the magic # and the length. Then you make sure you have a bytebuffer of the appropriate size and set its limit() to be length + 8 (8 being the header size). When you've read up to that limit, you know you have all of the object and can then use something like my ByteBufferObjectInputStream with an ObjectInputStream on top of it to get your object back via readObject().

    The downside of this approach, as someone pointed out, is that you need an entire ByteBuffer the size of the object bytes, which is rather wasteful of memory. Ideally you'd want to chunk up your object - but this is rather difficult to do with the current ObjectOutputStream design without blocking the thread calling writeObject(). To do chunking properly, you really want an ObjectOutputStream that can do partial writes and remember its state across invocations, so that you could do partial chunked non-blocking writes. But I haven't found a good way to do that yet. If you don't mind blocking the caller than you can do chunking with a smart stream beneath the ObjectOutputStream, but it so happens that I don't like blocking very much :-/

         -Mike
  20. Help with serialization[ Go to top ]

    Primary problem is with end detection. No bytes don't mean end of stream specially on slow networks. You can avoid it by sending length at begining and then wait till you receive all bytes.However I dislike this solution because of space. Some object are really big (50-100 kB) and I have to duplicate them in memory when I allocate byte array where they should be serialized. Specially when you don't know size in forward and many byte array reallocations appear. It's very dirty workaround for simple streaming.Your performance boost gained via NIO is lost and you can use standard IO with blocking sockets, available() call and simple streaming.
    Maybe you could use a chunked transfer encoding similar to the one defined by HTTP/1.1 (ftp://ftp.rfc-editor.org/in-notes/rfc2616.txt) section 3.6.1 and re-use a buffer with a fixed size. In the chunked encoding you write a length header, write a chunk of data with that very same size, write another length header etc. This way you don't have to buffer the whole serialized object graph in memory.

    There should be many implementations of this around for blocking I/O. Just grab one and adapt it to NIO.
  21. Help with serialization[ Go to top ]

    Hendrik - this is the right general approach. The question is how to do it efficiently with the specific ObjectOutputStream and ObjectInputStream implementation's we've got. OOS and OIS really, really want to work with streams, and don't like the concept of partial reads/writes and resumption of said reads and writes. You could easily do chunking today with an appropriate stream wrapped around OOS/OIS that did the chunking under the covers - but the caller to writeObject()/readObject() will block until the entire object gets chunked out. Yuck.

    I've considered writing a cleanroom ByteBuffer-oriented OOB/OIB (ObjectOutputBuffer and ObjectInputBuffer) but the task seems daunting, and I've never had the time.

        -Mike
  22. Help with serialization[ Go to top ]

    Wouldn't it be a solution to serialize your objects to a byte[] blob via ByteArrayOutputStream, wrap that array in a Buffer and transfer it?
    You mean wrap it in a non-direct ByteBuffer? If I'm not mistaken when you pass non-direct ByteBuffer to a SocketChannel instance, it will be transfered to a direct ByteBuffer first before SocketChannel really do the write operation to the network. This will increase the overhead even more. Beside, serializing the object to a byte[] blob, will increase the amount of memory that needs to be reclaimed when Garbage Collection occurs. Decreasing performance as well as scalability.

    Just my two cent..
  23. Help with serialization[ Go to top ]

    Pavel, see my other recent comments on how to do it in non-blocking mode. But I note you say "If you want do this you have to deregister socket from selector and that is most expensive operation".

    How do you define "expensive"? In my experience reading in a serialized Object via readObject() is an order of magnitude more expensive than cancelling a selection key.

        -Mike
  24. Help with serialization[ Go to top ]

    At begining thanks for your answer. It helps a lot.
    I can't choose WHAT I will read, but I can choose HOW I will read.
    I have some multithreaded server and all this ServerSocket.accept -> Socket -> Register -> Selector -> Select -> Cancel Key -> Select -> SwitchToBlockMode -> ObjectInputStream -> ReadObject -> SwitchToNonBlock -> Register looks like a lot of work just for reading a writing objects. Why we need this when we want transfer objects over NET. Is really Sun soo blind? Together with threading and synchronization problems in server I don't see NIO as advantage against my current blocking IO solution.
    Last question for expert. Do you guess that NIO on network really bring visible speed improvement?
  25. Help with serialization[ Go to top ]

    Do you guess that NIO on network really bring visible speed improvement?
    I would suggest reading the following article to gain a better understanding of NIO performance:

    http://www-106.ibm.com/developerworks/java/library/j-nioserver/
  26. Help with serialization[ Go to top ]

    Last question for expert. Do you guess that NIO on network really bring visible speed improvement?
    You *may* see some speed improvement given that using nio should reduce the number of buffer copies produced etc. but the real benefits of nio (IMHO) are in respect of throughput and scaling under load.

    Much of the reason for the existence of nio is to allow people to build servers in Java similarly to those written in 'C' which attempt to keep thread count low and make as much use of non-blocking/event-driven API's as possible. It's mostly an exercise in making better use of critical resources such as CPU, memory, reducing context switching and so on.
  27. Help with serialization[ Go to top ]

    You *may* see some speed improvement given that using nio should reduce the number of buffer copies produced etc. but the real benefits of nio (IMHO) are in respect of throughput and scaling under load
    I like the general principle of direct buffers, and I think in the long term they'll be a win. To date I don't their effect has been felt because any gain they may incur is swamped by losses elsewhere. To an OS hacker, an extra buffer copy may seem painful and a huge source of performance slow downs in the code. But in typical Java apps, almost everyone uses Object serialization - which has gotta be at least 2 orders of magnitude more expensive than a buffer copy, if not more. So who's going to see 10 microseconds (or whatever) of benefit from a direct buffer when their serialization takes 900 microseconds? :-)
    Much of the reason for the existence of nio is to allow people to build servers in Java similarly to those written in 'C' which attempt to keep thread count low and make as much use of non-blocking/event-driven API's as possible. It's mostly an exercise in making better use of critical resources such as CPU, memory, reducing context switching and so on.
    Yep - and explained much more concisely than my attempt!

        -Mike
  28. Help with serialization[ Go to top ]

    ServerSocket.accept -> Socket -> Register -> Selector -> Select -> Cancel Key -> Select -> SwitchToBlockMode -> ObjectInputStream -> ReadObject -> SwitchToNonBlock -> Register looks like a lot of work just for reading a writing objects.
    With the solution I outlined you don't have to switch to blocking mode, it works just fine in non-blocking mode so long as you keep a little bit of state around.

    Note also that "ServerSocket.accept -> Socket" is required for any solution. Excluding that, since it's common to any solution, and folding in the fact that you can do this in non-blocking mode, we have:

        Register -> Selector -> Select -> deregister read ops -> OIS -> read

    I've elided the extra Select because that's just part of a loop :-)

    Note also that "OIS" and "read" are also common to any solution to this problem, so the extra steps for NIO are actually:

        Register -> Selector -> Select -> deregister read ops

    Why go through this rigamarole? To get non-blocking semantics. A non-blocking I/O style can lead to much more efficient resource usage, a cleaner higher level code base, and better overall throughput.

    Is this harder than the old BIO model? Yes, it is. Any asynchronous/non-blocking solution is going to be a bit harder than a blocking/synchronous one. But blocking/synchronous models rarely scale well. So the extra steps are the cost of doing business.

    Please also note that in EmberIO, none of these extra steps are visible to the user of EmberIO. Whether you're in BIO or NIO mode, you'll just get a handleEvent() callback with your object (or a byte[] if you're not using Objects).
    Is really Sun soo blind? Together with threading and synchronization problems in server I don't see NIO as advantage against my current blocking IO solution.
    When you don't understand something, blame Sun :-)

    Here's an exercise for you: how well does your BIO solution work with 5,000 simultaneous socket clients? This absolutely requires 5,000 threads + overhead to do in a BIO model.

    In addition - imagine a requestor thread that needs to perform 5 "expensive" remote actions which are independent of each other, but still need to be done. For jollies let's say that each takes 100 milliseconds to perform, and that some of the data going back and forth is large enough to force some blocking. In an NIO sort of model, you can fire off all 5 actions at once, and then reap the "answers" asynchronously after firing them. The total cost of this will be 100 milliseconds + some overhead. In a typical BIO model, the cost of this will be 500 milliseconds + some overhead. You can try to model an asynchronous strategy on top of BIO - but this requires even more threads on top of the thread per connection that's the bare bones minimum.
    Last question for expert. Do you guess that NIO on network really bring visible speed improvement?
    Alot of people initially thought of NIO as something that would reduce latency - e.g. individual request times and responses would go down. As it turns out, this isn't the case - NIO code typically elongates latency. Where NIO really shines is in reducing the amount of resources required to service requests, and in boosting overall throughput e.g. scaling to handling many requests in a reasonable time frame. Now, as I mention in my blog entry, naive use of NIO can lead to really bad latency _and_ bad throughput. So part of the reason for creating EmberIO is to solve those problems. I use NIO in a very specific manner to try to minimize latency and maximize throughput. There's obviously a balance that has to be struck here, though, and that's why EmberIO has a number of different options - so that you can tune it to your own needs. Where BIO falls down flat on its face is that is inherently un-tunable. You have a thread per connection, it's in blocking mode, and whatever latency and throughput you get is what you get. There's no way to change the equation to differing needs.

    And this is also why EmberIO swings both ways :-) If you have a small number of simultaneous sockets and you want latency as low as possible, switch EmberIO into BIO mode, and you're done. If at a latter time requirements change, and you suddenly expect many more simultaneous clients, or if your I/O characteristics change radically (e.g. you're suddenly writing or reading very big objects, or remote activities are taking much longer than they were previously), then you can switch EmberIO over to a non-blocking thread pooling strategy to better handle those sorts of activities - and still not do too badly in the latency department.

        -Mike
  29. Reactor[ Go to top ]

    It's a nice framework for hiding NIO complexities and asynchronous IO events processing.It kinda reminds me Reactor and Connector/Acceptor/ProtocolHandler patterns see Patterns for Concurrent, Parallel, and Distributed Systems

    Only one thing which I don't like about Java NIO : there is no way how to specify timeout for IO handle (SelectabeChannel) you need to do it in code (as Mike does with TimeoutCB) or there is no error releated event. If client end point closes a channel or craches it triggers READ event.
  30. Reactor[ Go to top ]

    It's a nice framework for hiding NIO complexities and asynchronous IO events processing.It kinda reminds me Reactor and Connector/Acceptor/ProtocolHandler patterns see Patterns for Concurrent, Parallel, and Distributed Systems
    I haven't seen those specific references before, I'll have to check them out in detail - thanks.

    The stuff described there does look very similiar to what I'm doing in EmberIO. What I've found is that the trick for making non-blocking I/O very efficient is to combine the sort of high-level designs in the link you provided along with a number of low-level strategic optimizations as well. Either one alone doesn't cut it. I've seen a number of code bases that got the high-level semantics right for this sort of asynchronous/non-blocking domain, but the actual code which implemented was too inefficient. The biggest culprits at the low level are:

         
    1. Over-reliance on the Selector. Too many Selector wakeups.
    2.    
    3. Related to the above, no attempt to accomodate bursty data
    4.    
    5. Code paths are too long
    6.    
    7. Overreliance on genearlized Collections.
    #1 and #2 in particular often combine to make very handling of events way too inefficient. The Selector thread(s) are involved in too much context switching, and too much time is spent in List and Map implementations. To combat the former, I track event processing state and attempt to accomodate things like burstiness and endpoint quirks to do as much work as possible once an event has fired - this minimizes too many Selector context switches and associated Selector overhead. On the latter, there's _alot_ of optimization in EmberIO to use bit twiddling and fixed arrays which turns expensive lookups and accesses into very fast ones.

    And on the other end many implementations don't take into account the high level aspects - in which case no amount of tactical optimization will save you :-/
    Only one thing which I don't like about Java NIO : there is no way how to specify timeout for IO handle (SelectabeChannel) you need to do it in code (as Mike does with TimeoutCB) or there is no error releated event. If client end point closes a channel or craches it triggers READ event.
    I agree whole-heartedly on both. NIO really needs:

    • An EXCEPTION event per-selectable component
    • A TIMEOUT event per-selectable component
    • Better exception semantics at the Selector level
    The last in particular is killing me. If you disable an adapter under Windows XP, my Selector goes into a tight loop. It pops out immediately, obviously due to the ethernet adapter going down, but with no ready events. This may be more of an implementation issue than an interface-level one, but it's a real PITA nonetheless. I'm going to have to implement a loop back channel hack specifically to detect this. But beyond that, yes - too much is piggybacked on READ events right now.

         -Mike
  31. ACE[ Go to top ]

    Actually it's not a high level design only. There is a full blown impelementation
    for these patterns in ACE framework. An ACE framework. It is used in TAO realtime ORB implementation.
  32. ACE[ Go to top ]

    Actually it's not a high level design only. There is a full blown impelementationfor these patterns in ACE framework. An ACE framework. It is used in TAO realtime ORB implementation.
    ACE remains the single best software library I have worked with as a middleware implementer.

    Continuous reinvention drives me nuts.

    Greg
  33. Blog?[ Go to top ]

    Get 404 when I try to hit your blog: The requested resource (/page/pyrasun) is not available.
  34. Jroller has gone fishin'[ Go to top ]

    Apparently JRoller has gone fishing, is out to lunch, or otherwise engaged in some personal activity and as such is unable to serve up any content for most of this afternoon.

    Fortunately, the blog entry has been recreated (and reformatted with fancy graphics) here at TSS:

       http://www.theserverside.com/blogs/showblog.tss?id=DispellingNIOMyths

      -Mike
  35. one byte at a time reads[ Go to top ]

    Regarding the following section:

    "Most people don't seem to realize that Java sockets really love to deal with just one byte at first, and then open the flood gates immediately afterwards. For example, if you do a read quite often your socket will give you just one byte, or just a few - but a read immediately afterwards will give you a buttload of data."

    I'm not sure what you are really trying to say here. There should be nothing about Java sockets that "like to deal with just one byte at first".

    It sounds more to me like the client is writing things onto the network one byte or other primitive at a time, with no buffering. The result is one TCP packet is sent for the first write, the server receives it, you read it. Then they send another TCP packet with their next write, the server receives it, you read it. Doing a double read just allows long enough for the second packet to arrive.

    Depending on if they disable Nagle or not, either the client will then send all its data in little wee chunks, being extremely inefficient from a network perspective, or will have further data delayed by the Nagle algorithm, resulting in unnecessary latencies. While sometimes there is no better option than the client writing a byte at a time due to the nature of what the protocol is used for, normally there are a variety of options to avoid it.

    Watch the network traffic with tcpdump or some such, and I expect that is what you will see. While you are not always able to control the client, if you can then fixing this client problem will help the client, the network, and the server be more efficient. If this isn't the case, I would definitely be interested to hear more details.
  36. one byte at a time reads[ Go to top ]

    I'm not sure what you are really trying to say here. There should be nothing about Java sockets that "like to deal with just one byte at first".

    It sounds more to me like the client is writing things onto the network one byte or other primitive at a time, with no buffering. The result is one TCP packet is sent for the first write, the server receives it, you read it. Then they send another TCP packet with their next write, the server receives it, you read it. Doing a double read just allows long enough for the second packet to arrive.
    Nope, has nothing to do with writing single-character packets or anything like that. To be explicit:

      - A packet comes in. Let's say for yucks its 500 bytes.
      - Selector signals READ readiness
      - We stay in non-blocking mode. Let's say we're using fixed sizes here,
        so we "know" this is 500 bytes. We such out ByteBuffer to a limit of
        500 and issue a read.
      - Read returns <500 bytes, often just 1 byte, even though a 500 byte packet came in.

    Most people _never_ see this because very few people actually use non-blocking I/O, and they do the moral equivalent of readFully(). Of the few people who do use non-blocking I/O - they just follow what read() returns and act accordingly.
    Depending on if they disable Nagle or not, either the client will then send all its data in little wee chunks, being extremely inefficient from a network perspective, or will have further data delayed by the Nagle algorithm, resulting in unnecessary latencies. While sometimes there is no better option than the client writing a byte at a time due to the nature of what the protocol is used for, normally there are a variety of options to avoid it.

    Watch the network traffic with tcpdump or some such, and I expect that is what you will see. While you are not always able to control the client, if you can then fixing this client problem will help the client, the network, and the server be more efficient. If this isn't the case, I would definitely be interested to hear more details.
    All clients are written by me, don't write a character at a time, and do not disable Nagle.

    I understand fully the theory you're talking about - but have you actually statically measured the lengths that read() returns? I have. I don't care what the theory says it should be, I care what real JVMs actually do on common operating systems.

        -Mike
  37. one byte at a time reads[ Go to top ]

    All clients are written by me, don't write a character at a time, and do not disable Nagle.I understand fully the theory you're talking about - but have you actually statically measured the lengths that read() returns? I have. I don't care what the theory says it should be, I care what real JVMs actually do on common operating systems. -Mike
    Interesting. No, I have not measured it yet.

    Do you have an example of one or more combinations of JVMs and operating systems (with kernel versions if it is massive-changes-in-minor-revisions Linux) that you have seen this on? When you see it, is it fairly repeatable or sporadic?

    And, I guess, have you seen any platforms or JVMs where this reliably doesn't happen?

    This is not something I have heard about before (although I haven't yet had a need to dive into any NIO features, but that may change very soon), and based on my previous experiences with nonblocking IO on various platforms and languages it would seem to be unexpected and undesirable, and would be good to track down, wherever it lies... any help you can provide in terms of giving a few more details on platforms where you have seen it would be great.

    I agree that what theory says isn't as important for implementing things today as what practice shows, but if the two don't agree it is worth figuring out why, and I may be able to work that into some tasks I have to do anyway.

    Thanks!
  38. one byte at a time reads[ Go to top ]

    Interesting. No, I have not measured it yet.

    Do you have an example of one or more combinations of JVMs and operating systems (with kernel versions if it is massive-changes-in-minor-revisions Linux) that you have seen this on? When you see it, is it fairly repeatable or sporadic?
    Always measure, never assume :-)

    I've seen this on Windows XP w/ Java 1.4.2 and on HP-UX 11 with various 1.4 versions. Never tried it on Linux. As I mentioned somewhere (answer to a comment on the blog?) I first saw this at the C-level on SunOS 4.1.3.
    This is not something I have heard about before (although I haven't yet had a need to dive into any NIO features, but that may change very soon), and based on my previous experiences with nonblocking IO on various platforms and languages it would seem to be unexpected and undesirable, and would be good to track down, wherever it lies... any help you can provide in terms of giving a few more details on platforms where you have seen it would be great.
    Well, like I said almost no one I know of has actually bothered to investigate what gets returned from a blocking read call under load situations. They either use a debugger and dink around, or just never check it. I first observed this behavior, as I mentioned, on SunOS - around a decade ago. The performance I was seeing wasn't matching what I expected, so I investigated.

    As an aside (and this isn't directed at you Marc), I've noted two weird trends in the past 3 or 4 years around this sort of subject:

      - Few people do meaningful performance tests
      - Those who do, don't question the numbers from every angle, or they obsess only over one number.

    Way too many developers of comms software do zero true load testing, and of the remaining few who do most target one thing - most often, MBits/second or msgs/second - and never bother to look at any other variable. So they'll hit 90% of the theoretical bandwidth - and stop. Or hit their target msgs/second and stop. And more often than not they'll never notice that while their big messages satisfyingly hit 90% utilization, 5K messages use like 2% utilization. Or they hit a throughput number and are shocked whne someone comes along and says "BTW, at that throughput you're averaging 200 milliseconds latency - wouldn't you say that's rather high old chap?".

    This isn't just limited to the open source realm. Where I used to work people would pat themselves on the back for pumping 12,000 msgs a second through their systems. And our boss almost had a conniption when I did some simple tests in an afternoon and showed that latency was in the 300-400 millisecond mark (for a system where 50 milliseconds latency should've been seen as high).

    To pick on a few people in open source, though - the Jetty folks tried out NIO with approximately an afternoon's worth of work with a pretty NIO module of a couple hundred lines, found they didn't like the throughput or latency _and gave up_. And wrote a rather definitive sounding article based on those few hours of playing around effectively trashing NIO. Maybe as EmberIO results come rolling in they'll reconsider at some point.

     JGroups, aka JavaGroups, has been around for over three years and the team there still has never published any meaningful performance tests of their own (no, I don't consider the tests here http://www.jgroups.org/javagroupsnew/docs/performance.html to be meaningful!). I understand the bitterness people have over benchmarks, but just offering no benchmarks is far worse. There is no _good_ reason for them to not perform and publish such tests - but there are many bad ones (no, I won't speculate which bad ones apply - too little information).

    It'd be nice if open source projects spent just a tad more time on the baseline technology and realistic testing, and a bit less on things like writing code to coddle users who don't know the right IP address to use on a multi-NIC machine.

        -Mike
  39. ActiveMQ and EmberIO[ Go to top ]

    http://cvs.activemq.codehaus.org/activemq/src/java/org/codehaus/activemq/transport/ember/