Discussions

News: High-perfor­mance pattern matching algo­rithms in Java

  1. Johann Burkard posted "High-perfor­mance pattern matching algo­rithms in Java," a string search library (version 1.2) that implements a set of string search algorithms in Java. It looks mostly academic, but might be interesting. He says that the algorithms are easily "five to ten times faster" than the naive implementation in java.lang.String.

    I'm not really sure if this makes sense, though. If your search pattern is really large, or your string being searched is really large, then you have a specialized case, in which some of the alternative algorithms might be faster, but the version in java.lang.String isn't exactly "naive" - it's the Knuth Morris Patt algorithm, which is generally very optimal, especially for common string lengths (where you're talking about a few hundred characters at most). Some of these algorithms have startup times that help longer runtimes, but the simpler algorithms would be done by the time the more complex algorithms get started.

    When Mr. Burkard says this is a "naive" algorithm, he makes me fret that he doesn't quite understand what "naive" means, or that he doesn't know what Knuth Morris Pratt is about.

    A naive search, to me, walks through each character in the target string, searching for the first letter in the search string; if it finds one, it then checks to see if the next character in the target string matches the next character in the search string. KMP isn't a naive search, and it optimizes the search very well for the general case.

    That said, who knows? This library might be really useful for someone. I know that way back in the day I thought that Java's library source being available meant that maybe stuff like this would have a best-of-breed philosophy attached to it, so that a simpler search might be replaced by a better one if people worked something out.

    Obviously, that's happened a few times, but mostly by insiders; ordinary guys like me would never have code used by the JDK. I wouldn't even know where to start.

    Anyway: string search 1.2.

  2. I'm not really sure if this makes sense, though. If your search pattern is really large, or your string being searched is really large, then you have a specialized case, in which some of the alternative algorithms might be faster, but the version in java.lang.String isn't exactly "naive" - it's the Knuth Morris Patt algorithm, which is generally very optimal, especially for common string lengths (where you're talking about a few hundred characters at most). Some of these algorithms have startup times that help longer runtimes, but the simpler algorithms would be done by the time the more complex algorithms get started.

     

    Well, this is not really the case. In particular : 

    The Java language lacks fast String searching algorithms. String “indexOf(...)” and “lastIndexOf(...)” operations perform a naive search for the provided pattern against a source text. The naive search is based on the “brute force” pattern first exact string matching algorithm. The “brute force” algorithm consists in checking, at all positions in the text, whether an occurrence of the pattern starts there or not. Then, after each attempt, it shifts the pattern by exactly one position to the right. Nevertheless several other algorithms exist that outperform by far the “brute force” algorithm in both speed and efficiency.

    That's the actual case for OpenJDK 1.6.0. Furthermore you can find an in depth analysis on Spring performance tunning technics along with the most complete collection of Exact String Matching algorithms in Java (and a relevant performance comparison test) at the following article :

    Java Code Geeks: Java Best Practices - String performance and Exact String Matching

     

    Best Regards

    Justin