Discussions

News: Theory and Practice of Ropes for Java String Manipulations

  1. Systems that manipulate large quantities of string data are poorly served by the Java language's default String and StringBuilder classes. A rope data structure can be a better alternative. This article introduces Ropes for Java, a rope implementation for the Java platform; explores performance issues; and provides pointers for effective use of the library. For more Java resources, and to specifically learn more about Java Testing and Debugging, take a look at this developerWorks Space. For an enterprise Java developer, one of the most interesting considerations for Ropes is that iteration over a flat Rope (a Rope with a depth of 1) is much faster than pulling data character by character from a String. (If you have no need to do this, then Ropes may not mean much of anything to you. It's also worth noting that Java's NIO structure has mmapping capabilities to make iteration over buffers extremely fast, and that Ropes are especially appropriate for editors, not enterprise Java applications - but ignorance isn't bliss, it's ignorance.) An important extract for Java EE developers:
    Often, enterprise Java applications contain code similar to the following:String x = "";x is subsequently sent as part of an HTML page to the client's browser. Does it make sense to use ropes to compute the value of x, rather than a StringBuilder, which is what the compiler generates by default? The answer is no, for a variety of reasons. To begin with, the amount of data being concatenated is, presumably, small, so using a rope is unlikely to improve performance, although it could improve robustness and scalability. (Consider how both solutions would behave if getName unexpectedly returned a 50 MB string.) But for the sake of argument, imagine that many more chunks of data are being concatenated. Given that Rope's append performance is generally better than StringBuffer's, would a rope make sense then? Again, no. Whenever input data is being combined to produce formatted output, the cleanest and most efficient approach is to use a template engine such as StringTemplate or FreeMarker. Not only does this approach cleanly separate presentation markup from code, but templates are compiled once (often to JVM bytecode) and then can be reused, giving them excellent performance characteristics. A second benefit of using templates exposes a fundamental flaw common to output-building routines like those in the code above, including one written using ropes. This benefit is that templates can be incrementally evaluated, and output can be written to a Writer as soon as it is produced, rather than unnecessarily first being accumulated in memory. In a Java EE application, where the Writer is actually a buffered connection to the client's browser, this approach to output rendering uses constant memory, O(1), versus the other solutions' O(n) memory usage. This is a huge improvement to application scalability and robustness, even though it's not apparent for smaller inputs or at lower application loads.

    Threaded Messages (14)

  2. It's a very cool approach, and clearly has some good efficiency advantages. But it has some drawbacks too, which he mentioned in a side bar to his article, but not prominently. Namely: ropes can cause major memory leaks. For example: Rope r = new Rope(); r = r.substring(0, 1); The result is a 1-character rope ... but which is carrying around a massive amount of characters along with it!
  3. Ropes for Java[ Go to top ]

    The other problem with this is that R4J is a GPL licensed project. I know a lot of projects that might look at this but can't because of the GPL license. For example, every single Apache project is unable to use this.
  4. Re: Ropes for Java[ Go to top ]

    +1 GPL should be forbidden for libraries (just kidding, freedom is freedom)
  5. It's a very cool approach, and clearly has some good efficiency advantages. But it has some drawbacks too, which he mentioned in a side bar to his article, but not prominently. Namely: ropes can cause major memory leaks.

    For example:


    Rope r = new Rope();
    r = r.substring(0, 1);


    The result is a 1-character rope ... but which is carrying around a massive amount of characters along with it!
    The same applies to standard Java Strings, I believe. For example:
    String s = new String();
    s = s.substring(0, 1);
  6. I didnt understand how the below code is causing memory leaks. can you tell me... Rope r = new Rope(); r = r.substring(0, 1);
  7. Re: How it causes memory leak[ Go to top ]

    I didnt understand how the below code is causing memory leaks. can you tell me...

    Rope r = new Rope();
    r = r.substring(0, 1);
    The new string instance holds a reference to the structure backing the rope. If that structure is large, you will be using much more memory than a 1 char string normally would and it will not be released until after the 1 char string is unreachable. Technically this isn't a memory leak but in Java-land that's what we call this. This is true in older versions of Java (including 1.4) for String.subString() and StringBuffer.toString() also but seems to have been done away with in recent versions.
  8. Re: How it causes memory leak[ Go to top ]

    This is true in older versions of Java (including 1.4) for String.subString() and StringBuffer.toString() also but seems to have been done away with in recent versions.
    Correction: String.toString() still shares the underlying array in 1.6. StringBuffer.toString() no longer does.
  9. Memory Leak[ Go to top ]

    The same thing that causes a "memory leak" in this case can be thought of as an optimization. By re-using the array, and simply using different count and offset values, you're saving memory when dealing with substrings. FYI: The trim() method has the same properties as the substring method in terms of the re-use of the array, as it's implementation actually uses the substring method. If it's that much of an issue, you can instead do: s = new String(s.substring(0,1)); This copy constructor actually creates a new array if the source array has unused space in it, ensuring that the new string is backed by an array of exactly the right size. Is there something similar in a Rope object?
  10. A small point[ Go to top ]

    Very interesting read. I have one nitpicky question though: Isn't the maximum size of a String 2^16 characters? That's a lot less than 50 megabytes. Or do I have it wrong?
  11. Re: A small point[ Go to top ]

    The maximum size of a String in Java is roughly 2GB, theoretically, if your JVM could manage that.
  12. Re: A small point[ Go to top ]

    Isn't the maximum size of a String 2^16 characters?
    ...
    Or do I have it wrong?
    It's Integer.MAX_VALUE (= 2^31 -1), the maximum length of a char[] (in theory).
  13. hello?! let's clear up a few things[ Go to top ]

    A memory leak in Java land is the same as a memory leak in any other land and s = new String(); s.substring(0,1); will not produce a memory leak, but a very long string object waiting for garbage collection. What happens if you use Rope instead of String... i don't know. Also: the maximum length of a string is not 2^31 because Java saves Strings in modified UTF-16 where one character in supplemental Unicode planes can take up to 6 bytes but not contain a zero-byte (which is great for C compatibility). So the maximum length of a String depends on two things: "what characters do you use" and "what architecture does your VM run on", because the memory limits of a 64-bit VM are very different (naturally) from a 32 bit VM. The maximum length of a String may very well be 2^31 BYTES. Just for completeness: String.length() will not give you accurate results though, because for backwards-compatibility this function returns the count of UCS-2 characters in your string, so it will NOT give accurate results if your String contains characters that are only representable by 3+ bytes.
  14. Re: hello?! let's clear up a few things[ Go to top ]

    A memory leak in Java land is the same as a memory leak in any other land
    A memory leak is caused by memory that is unreachable but not deallocated. In Java this is the exact situation that makes memory available for collection by the garbage collector. The situation described here, where memory is referenced but no longer needed is not the same thing. The term I've seen used for this is a 'space leak'. But since memory leaks don't technically exist (or a always do, depending on your perspective) calling this a memory leak doesn't create any real confusion.
    and

    s = new String();
    s.substring(0,1);

    will not produce a memory leak, but a very long string object waiting for garbage collection.
    The original String object can be collected but the large char[] backing it (the vast majority of the Strings memory) will not be collectable until the second String is unreachable (assuming no other object refer to the array.)
  15. Re: hello?! let's clear up a few things[ Go to top ]

    Also: the maximum length of a string is not 2^31 because Java saves Strings in modified UTF-16 where one character in supplemental Unicode planes can take up to 6 bytes but not contain a zero-byte (which is great for C compatibility).
    No String can be longer than (2^31)-1, so that makes (2^31)-1 the maximum length of a String in Java.
    So the maximum length of a String depends on two things:

    "what characters do you use" and
    "what architecture does your VM run on", because the memory limits of a 64-bit VM are very different (naturally) from a 32 bit VM.

    The maximum length of a String may very well be 2^31 BYTES.
    The limit is (2^31)-1 because arrays are addressed and sized with positive integers. String is not only backed by a single array but all methods that define length (even the unicode code point versions) return an int.