Java Best Practices – Char to Byte and Byte to Char conversions

Discussions

News: Java Best Practices – Char to Byte and Byte to Char conversions

  1. Continuing our series of articles concerning proposed practices while working with the Java programming language, we are going to talk about String performance tunning. Especially we will focus on how to handle character to byte and byte to character conversions efficiently when the default encoding (UTF-16) is used.

    This article concludes with a performance comparison between two proposed custom approaches and two classic ones (the "String.getBytes()" and the NIO ByteBuffer) for converting characters to bytes and vice – versa.

    Read more at Java Code Geeks : Java Best Practices – Char to Byte and Byte to Char conversions

    Threaded Messages (8)

  2. FWIW[ Go to top ]

    Justin -

    OK, a couple points (since we're counting clock cycles):

    When all characters to be converted are ASCII characters, a proposed conversion method is the one shown below ..

    public static byte[] stringToBytesASCII(String str) {
     char[] buffer = str.toCharArray();
     byte[] b = new byte[buffer.length];
     for (int i = 0; i < b.length; i++) {
      b[i] = (byte) buffer[i];
     }
     return b;
    }

    It's much better to call:

    str.getBytes(0, str.length(), b, 0);

    It will avoid an allocation and a copy, among other things.

    In another example, accessing the same byte from a byte[] twice is a poor performer (since the JVM isn't allowed to access it once and use the same value for the second access).

    Using toCharArray() is also a poor performer, because you don't know the size of the array, and some JVMs will perform poorly when making larger allocations (it won't use the slab). To avoid this, having a small (e.g. 256 byte) buffer and use getChars() for each chunk. I've profiled it with various power-of-two sizes, and at 256 it's never slower than toCharArray(), and it doesn't get measurably faster at larger sizes.

    Peace,

    Cameron Purdy | Oracle Coherence

    http://coherence.oracle.com/

  3. FWIW[ Go to top ]

    Hello Cameron,

    Thanx a lot! Your hints are more than welcome and valuable to me! I will try them out and update the article with any perfomance gains I will find!

    Thanx in advance!


    BRs

  4. Why manual conversion[ Go to top ]

    Hello,

    I must say that I find manual decoding of bytes into characters a generally very bad idea. There are very efficient CharsetDecoders and Encoders that do the job very well.

    The presented methods, such as String#getBytes(Charset) always create a new CharsetDecoder. This is, of course, very inefficient.

    If my application does a lot of byte-char conversion, why not keep a pool of CharsetDecoders that allow for threadsafe and efficient conversion?

    Sorry, but this is reinventing the wheel.

    Kind regards,

    Karol

     

  5. Why manual conversion[ Go to top ]

    Hello Karol,

    If you think that having a pool of CharsetDecoders will perform better than our proposed approach then we will be more than interested to know your performance results. Let me pinpoint though that our proposed solution is "encoding" agnostic and there is no decoding or character conversion performed. Having the above in mind I strongly disagree that you will achive superior performance results compared to our approach with the use of pooled decoders.

    BRs

  6. Re[ Go to top ]

    Hello Justin,

    the code in the example will perform better than the CharsetDecoder for UTF-16, that is correct.

    The problem is that it doesn't really cover all aspects of UTF-16. It ignores byte order marks, for example, and only works on a specific endianness of the input. If you modify the code to have a generic UTF-16 decoder or encoder, you will yield similar performance to that of the existing CharsetEncoder or CharsetDecoder.

    But yes, if restricting the input to a subset of UTF-16 is ok, the code will perform better and might be a valid choice for a performance critical application.

    On the other hand, I believe that the performance gain will be knocked out by the limitations of the code and the fact that you will most likely need to support other character sets in future, such as UTF-8 or some of the 1 byte encodings. I'm not a fan of proprietary solutions unless there is no other way.

    The good news is, it's up to the developers to decide :-)

    Kind regards,

    Karol

     

  7. wrong[ Go to top ]

    This is wrong in so many levels...

    For one (important) thing: 

    If no “charsetName” is provided, the operation encodes the String into a sequence of bytes using the platform's default character set (UTF-16).

    The "default character set" is not UTF-16. Not in general, and not typically. It "The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system." (ref)

    The author seems to be confusing the charset that Java uses internally (in memory) for encoding strings (which is UTF-16 -with some caveats) with the platform default charset set (which is used when performing conversions string-bytes). This last operation is typically  need when "going to the outside" world (writing/reading to filesystem or network, serializing/deserializing, etc). And in this situation UTF-16 is much less used than UTF-8 or some ISO8859-1 variant.

    Apart from this, the article does not seem to me a good sample of "Java Best Practices" at all.

     

  8. wrong[ Go to top ]

    Herman,

    Please read the comments on the post. The designated character encoding is just a hint for the platform internal representation of Strings. As a matter of fact I have commented exactly the same thing as you post here. The provided approaches are encoding "agnostic" and thus fast. I must admit that putting the internal encoding tip (UTF-16) confused some readers, but I believe that my comments clarified things a bit.

    BRs

  9. If the goal is to speed up the transmission of strings over the wire, how would the overhead of the extra bytes affect the overall performance?  What you gain in conversion may be more than offset by the extra bytes, especially if the string is long enough to cause extra packets to be transmitted.