Create regular expressions in XML with the Regexml open source library

Discussions

News: Create regular expressions in XML with the Regexml open source library

  1. Regular expressions are great at parsing portions of text out of a string or determining whether text matches a specific pattern. However, this power comes at a cost. Regular expressions can be very complex to write, hard to document, and difficult to understand. The Regexml project provides a simple way to define and document complex regular expressions in XML. For example, the following Regexml expression defines a simple zip (postal) code:

    <regexml xmlns="http://schemas.regexml.org/expressions">
        <expression id="zipcode">
            <start/>
            <match equals="\d" min="5" capture="true"/> <!-- 5 digit zip code -->
            <group min="0">
                <match equals="-"/>
                <match equals="\d" min="4" capture="true"/> <!-- optional "plus 4" -->
            </group>
            <end/>
        </expression>
    </regexml>

     

    After consuming this XML, the Regexml library creates and caches a standard java.util.regex.Pattern object that can be used to parse data out of text or determine if the text matches a pattern. The capture attribute in the XML above indicates the portions of the text that should be parsed out and made available to the client application. The equivalent regular expression looks like this:

    ^(\d{5})(?:-(\d{4}))?$

    Though the traditional regular expression is far shorter, it's brevity and cryptic symbols make it harder to read and understand. Of course, for simple expressions like this one, a traditional regular expression specified in the code may be most appropriate. However, as expressions become more complex, the ability to document them in-line, employ whitespace to show hierarchy, and use expressive attributes rather than symbols can simplify maintenance and debugging. For more information about the open source Regexml project, see the overview and comprehensive introduction at:

     http://www.regexml.org/

    Threaded Messages (18)

  2. Nice idea but why use XML for this?  XML based code is difficult to read and that seems to work against the goal of this tool.

  3. still cryptic[ Go to top ]

    Instead of using \d in

      match equals="\d"

    could you use something like "digits".  It's still somewhat cryptic with that \d.

  4. still cryptic[ Go to top ]

    Instead of using \d in

      match equals="\d"

    could you use something like "digits".  It's still somewhat cryptic with that \d.

    You know, I did consider that. The reason I stuck with the regular expression syntax in this case (though I'm very open to any ideas that would make it simpler) was primarily to support character classes. For example, what if rather than just digits, you wanted to specify that digits plus the letters a through f are valid? You would accomplish that like this in Regexml:

    <match equals="[\da-fA-F]"/>

    Of course, this works as well:

    <match equals="[0-9a-fA-F]"/>

    The "\d" is really just a convenient shortcut for "[0-9]" that those who have used regular expressions will be familiar with. So, I do like the idea of simplifying how character sets are specified but not at the expense of being able to define more complex rules.

  5. still cryptic[ Go to top ]

    I think what you may need is an XML way to define and reference character classes (and bundle pre-defined character classes with your schema).

  6. still cryptic[ Go to top ]

    I think what you may need is an XML way to define and reference character classes (and bundle pre-defined character classes with your schema).

    That could work. Providing an "XML way" to define character classes could make them easier to read. I was a bit concerned about the size of the expression exploding though. Keeping the character class definition compact reduced the amount of XML. In fact, right now you can use groups to create a form of character class. Consider this example:

    <regexml xmlns="http://schemas.regexml.org/expressions">
        <expression id="example">
            <group operator="or">
                <match equals="a"/>
                <match equals="d"/>
                <match equals="f"/>
            </group>
        </expression>
    </regexml>

    This example will match on the character a, d, or f. This is equivalent to this expression:

    <regexml xmlns="http://schemas.regexml.org/expressions">
        <expression id="example">
            <match equals="[adf]">
        </expression>
    </regexml>

    The second example is a lot shorter. Of course, the goal of this project is to create more readable expressions rather than shorter ones. However, in this case, I don't think that the "XML way" of defining a character class is worth the additional size. Perhaps another way of defining character classes in XML could be both more intuitive and not quite so drawn out.

  7. Nice idea but why use XML for this?  XML based code is difficult to read and that seems to work against the goal of this tool.

    The thought was, "what format would be more human readable than a long, complex regular expression string?" XML at least allows for the use of white space and inline comments as well as named elements/attributes in order to convey meaning. The simple example shown doesn't really do it justice. After building a complex expression in Regexml, I've been very surprised how unreadable the equivalent regular expression is (Regexml allows you to retrieve the equivalent regular expression if desired). XML schemas are another reason that I chose XML. By providing a schema, Regexml expressions are easy to create in any XML editor. The schema allows the editor to offer "code completion" type of assistance reminding the developer of the valid elements and attributes at any point in the expression.

  8. The thought was, "what format would be more human readable than a long, complex regular expression string?" XML at least allows for the use of white space and inline comments as well as named elements/attributes in order to convey meaning. The simple example shown doesn't really do it justice. After building a complex expression in Regexml, I've been very surprised how unreadable the equivalent regular expression is (Regexml allows you to retrieve the equivalent regular expression if desired). XML schemas are another reason that I chose XML. By providing a schema, Regexml expressions are easy to create in any XML editor. The schema allows the editor to offer "code completion" type of assistance reminding the developer of the valid elements and attributes at any point in the expression.

    In general, I think this is a great idea.  I'm wondering why this kind of thing hasn't been done before (assuming it hasn't.)

    For the readablility aspect, you could have gone with something like JavaCC and created a DSL explicitly for this purpose.  Code completion is a little trickier but it can be done.  I just don't know what level of effort is required.

    Another approach that gives you the code completion is the commonly used Builder pattern.

    I don't really have a problem with XML from a persistance perspective but it's not well suited for code.  XML has a low signal to noise ratio.

     

  9. Representations[ Go to top ]

    Why not make the library agnostic of the representation?

    Support multiple representations in addition to XML like JSON, YAML, etc....

    It should be relatively easy to form an object model representing a regex, and then it simply becomes a matter of creating an appropriate parser for the representation.

  10. Representations[ Go to top ]

    Why not make the library agnostic of the representation?

    Support multiple representations in addition to XML like JSON, YAML, etc....

    It should be relatively easy to form an object model representing a regex, and then it simply becomes a matter of creating an appropriate parser for the representation.

    This project doesn't attempt to create a new object model to represent a regex because that's already been done by the various platforms (Pattern object in Java, Regex object in .NET, etc.) Rather, this library simply translates an expression from a more readable XML format into the more cryptic regular expression syntax which is then used to create the object model. So, basically this project is the implementation of a parser for the XML representation of a regex. As you suggest, parsers for other representations could be created as well.

  11. Representations[ Go to top ]

    This project doesn't attempt to create a new object model to represent a regex because that's already been done by the various platforms (Pattern object in Java, Regex object in .NET, etc.) Rather, this library simply translates an expression from a more readable XML format into the more cryptic regular expression syntax which is then used to create the object model. So, basically this project is the implementation of a parser for the XML representation of a regex. As you suggest, parsers for other representations could be created as well.

    This is clearly a very pragmatic approach but if one of your goals is to make regxml consistent across platforms, it's not going to possible using this approach.

  12. regexp is a standard[ Go to top ]

    Yes, it's a pain in the @ss to get the right expression.  However this expression is cross programming language standard.  So, if one is dealing w/ regexp then you're time is better spent on learning regexp then a xml based regexp.  Just a thought!

  13. regexp is a standard[ Go to top ]

    Yes, it's a pain in the @ss to get the right expression.  However this expression is cross programming language standard.  So, if one is dealing w/ regexp then you're time is better spent on learning regexp then a xml based regexp.  Just a thought!

    Yep. That may be true. However, the idea here was to create a regular expression language that was so simple that it would hardly take any time to learn it (especially for those familiar with regexp).

  14. Yes, it's a pain in the @ss to get the right expression.  However this expression is cross programming language standard.  So, if one is dealing w/ regexp then you're time is better spent on learning regexp then a xml based regexp.  Just a thought!

    I agree – putting a complex regexp together can something make you feel like you are trying to push a boulder off a cliff. You know that it will fall with a phenomenal speed, IF you can succeed at getting it over the edge. For me though, I would rather prefer some kind of builder rather than have me learn another set of syntax tags based on XML, or simply, put the effort into getting a good grip on regexp which I can use over a wide set of applications.

  15. What ever comments and improvements could be made, this library surely makes regexpr more readable and because of the usage of XML is open and fairly crossplatform. Hence I like it.

    Thanks for coding it!

  16. What ever comments and improvements could be made, this library surely makes regexpr more readable and because of the usage of XML is open and fairly crossplatform. Hence I like it.

    Thanks for coding it!

    Thanks for the kind comments Tom. In fact, the goal of this project is not only to make regular expressions simpler to read and write but also to make them more platform independent. Since a Regexml port to any other platform (.NET is currently being considered) would use the same XML schema, expression files could be copied over and used without modification (eliminating the small differences that currently exist between regular expression implementations on different platforms).

  17. What ever comments and improvements could be made, this library surely makes regexpr more readable and because of the usage of XML is open and fairly crossplatform. Hence I like it.

    Thanks for coding it!

    Thanks for the kind comments Tom. In fact, the goal of this project is not only to make regular expressions simpler to read and write but also to make them more platform independent. Since a Regexml port to any other platform (.NET is currently being considered) would use the same XML schema, expression files could be copied over and used without modification (eliminating the small differences that currently exist between regular expression implementations on different platforms).

    Even though I think you are completely unsympathetic to my concerns about using XML, I think it's worth pointing out that all DSLs are just as cross-platform as an XML-based DSLs.  I do understand why XML is a common choice, though.  You don't have to build a parser.  You might want to consider, however, that the creator of Ant later expressed regret and apologized for basing it on XML.

  18. Even though I think you are completely unsympathetic to my concerns about using XML, I think it's worth pointing out that all DSLs are just as cross-platform as an XML-based DSLs.  I do understand why XML is a common choice, though.  You don't have to build a parser.  You might want to consider, however, that the creator of Ant later expressed regret and apologized for basing it on XML.

    Thanks for your comments James. I understand your concerns with XML. I am also aware of James Davidson's comments about how he'd opt for a more script-like language rather than XML if he was to build Ant again (like Gant). A big drawback with using XML for Ant was the lack of scripting logic like loops and conditional constructs. However, I think that's more of a concern when constructing a build tool than when defining a regular expression. And you're absolutely right, XML was chosen because of what came with it for "free" (editors, parsers, schemas, universally understood, etc.) even though a DSL could have made the syntax even more intuitive.

  19. This sounds like fun!