Faster Java Regex from java.net article

Discussions

News: Faster Java Regex from java.net article

  1. Faster Java Regex from java.net article (7 messages)

    In "A Faster Java Regex Package," Tom White discusses Anders Møller's dk.brics.automaton package, which claims that it is significantly faster (see Java Regular expression library benchmarks) than all other Java regex libraries, at the cost of some flexibility compared to the normal Perl-like regex libraries.
    The bottom line is that the JDK regex implementation supports a rich set of regular expressions, whereas dk.brics.automaton supports only the basic features. If you are not concerned about speed - for example, you are just doing a few matches on strings of modest length - then use the JDK implementation. However, if regular expressions are a bottleneck in your program and you don't need advanced regex features, then dk.brics.automaton might be a great fit.

    Given the issues of the JRE's size, the question should be asked: does the JRE need a full Perl-like regular expression language, or is a simpler expression language sufficient?

    Threaded Messages (7)

  2. Replace?[ Go to top ]

    Does this library support replacement? I can't see from the JavaDocs or the article that it does.
  3. I use it[ Go to top ]

    I think a lot of code would break if the regular expression package lost features in exchange for optimization. So the discussion is purely hypothetical, it will never be changed. Features will be added, bugs will be fixed but there is no way features will be removed.

    Optimization without losing features is of course possible but is also likely to affect pattern compilation. The tradeoff has probably already been made in favour of not doing that. Maybe some more can be done but I don't expect to see major improvements in this area.

    If you need the performance and not the features, alternative implementations of the reg expression functionality exist or you can write some string manipulation code yourself. The same reasoning applies to many of the standard packages. They are good enough most of the time but alternatives exist. The reality is that in most java applications, regular expressions are not performance critical. Dramatically improving regular expression evaluation will have little or no impact on the performance of most applications that use them.

    If regular expression code is critical in your application (as in you have profiled your code and identified methods from this package as the ones to blame for your performance issues), you have plenty of ways to optimize. In my experience, few Java developers have a good understanding of the real bottlenecks in their software.

    String manipulation and how the JVM optimizes it remains a poorly understood subject. Even much of the advice on optimizing string manipulation is in fact not correct (e.g. using + instead of StringBuffer.append, the JVM optimizes many uses of + itself).
  4. Why couldn't the full, general JDK package detect the simple cases and optimize for them? Patterns have a compiler (Pattern.compile()).


    Pat Niemeyer
  5. Why couldn't the full, general JDK package detect the simple cases and optimize for them? Patterns have a compiler (Pattern.compile()).

    Sounds like a good RFE. It probably already exists.
  6. By the way, referred benchmarks are claiming that dk.brics.automaton is about 30% faster then com.karneim.util.collection.regex (aka jrexx), but my own evaluation shows that jrexx is about 10 times (if not more) faster if you take into account time needed to precompile regex...
  7. The question of whether or not the JRE needs rich perl like regex support is difficult to answer. I have worked in a lot of different verticles and enviroments and I frequently use some of the more "advanced" features of Java's current library. I am understandbly happy, therefore, with the breadth of Java's regex support.

    I have no idea what percentage of Java code uses these advanced regex features. This is really the only way to answer the question.
  8. I haven't looked at the details, but I did write Apache's regexp library back in 1996. The tradeoff in general here is that a DFA implementation is considerably faster, BUT it doesn't keep enough state to do reluctant closures or to capture groups in certain situations. To do these things you must have a stack, which means either a pushdown automata or recursion. Java absolutely made the right choice here and I'm quite happy with the JDK's regexp support. So happy, in fact, that I would not recommend using anything else unless you are seeing a big bottleneck and don't need any of its features. I would also (sadly!) recommend against using my old Apache regexp package unless you are very space-constrained and using a lot of expressions (in which case, it may be just what you need because expressions still do Perl-like things with just a couple of classes, but those expressions can be precompiled down to byte arrays). I would recommend taking a peek at the MetaPattern class in Wicket's (http://wicket.sf.net) util package. This package, when understood and used well can help you reduce the amount of parsing code you write to solve certain problems while actually increasing the readability of your patterns.