Sex, Eger!

or

Reverse Regular Expressions

  1. What is a Reversed Regular Expression?
  2. Applying "Sexeger"
    1. Better Efficiency in Reversal
    2. Case Study: C Comments
  3. Uses of Sexeger
    1. Look-behind Assertions
    2. Inserting Commas in Numbers
  4. Where Do We Go From Here?

Update! sheriff_p from #EFnet::perl will be giving a Lightning Talk on this paper in my stead at YAPC::Europe!

What is a Reversed Regular Expression?

If we're able to look at the specific nodes of a regular expression, then we should be able to create a regex that works from the BACK of a string to the FRONT. What do I mean? ($last_numbers) = $string =~ /(\d+)(?!\D*\d)/; That regex finds the last occurrence of digits in $string. Now, it's not all that bad, but it could be made a great deal simpler by:
  1. reversing the input string
  2. reversing the nodes of the regex (optimize it, too)
  3. reversing the match
"How in the world is that simpler?" you ask. "Just watch!" I reply. $last_numbers = scalar reverse( # reverse the match reverse($string) =~ # reverse the input string /(\d+)/ # reverse the regex ); Now, the regex (\d+)(?!\D*\d) can also be written as (\d+)(?=\D*$). If digits aren't followed by another digit, then they are followed by only non-digits, if anything. In the process of reversal, the look-ahead can be eliminated, since the regex would be \D*(\d+) which has a redundant leading "node", \D*. The regex's meaning is not changed by its removal.

We already have re.pm which can pick a regex apart during compilation and during the actual matching stage. This has the ability to pick apart individual nodes of a regular expression.
Note: the term "node" refers to a combination of regex atoms and quantifiers which can match a given substring backwards or forwards. In the regex \d+[abc]def(foo|bar), the nodes are:

This regex, reversed, would be (rab|oof)fed[abc]\d+


I will call reversed regular expressions "sexeger" (the reverse of the string "regexes"). As Damian Conway suggested, "sexeger" follows the rule of "sheep" or "fish" -- it is both in the singular and plural form.

For the mean time, there is no way to automatically generate a sexeger. You have to do it by hand. But the benefits of it are amazing. Maybe they need to be seen to be believed. If so, here is an example.

Applying "Sexeger"

You have a simple programming language, called KC ("kinda crummy"). It has a VERY simple grammar: The string definition is a double-quoted strings with escapes allowed in it, like "foo \"bar\" \\ blat". All programs written in KC can be matched by the following regex: $CODE =~ m{ \A (?: [^<>"]+ # non-string, non-comment | "[^"\\]*(?:\\.[^"\\]*)*" # string | <[^<>]*> # comment )* \z }x; We can write a regex to match an entire file (shown above). Oh, before I go any further, here's the sample program: <this is a sample program> int x = 10; <what a silly grammar> str y = "cool \" beans"; if (len(y) GREATER_THAN x) { <can't use gt and lt symbols... heehee> <empty comment coming up> print y, " is longer than ", x, " characters"; <> <and more stuff to come soon> chop(y,x); print "I sliced 'y' down to ", x, " characters for you"; } Let's say we wanted to extract the LAST comment from this program as well as be sure it was coded properly. Here's my regex. Note: I used the "cut" regex operator (?>...) which matches without allowing a backtrack, to speed up the non-string, non-comment part. ($last_comment) = $CODE =~ m{ \A (?: <[^<>]*> | "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )* (<[^<>]*>) (?: "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )* \z }x; I'll also implement a reversed version. The reversed terms for the grammar are: $last = scalar reverse(reverse($CODE) =~ m{ \A (?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) )* (>[^<>]*<) (?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) | >[^<>]*< )* \z }x); If you look long enough (it shouldn't take too long) you'll see that EACH node of the terms has been reversed properly. Now, I ran a benchmark and it showed that the reverse method was actually faster than the forward version, EVEN with having to reverse the input string AND the value of $1. When I implemented a cached version, where the reverse of $CODE was already stored, it was even faster (obviously). Here are the wonderful results:

Running for at least 15 seconds
Method Hz # iters
forwards894.14/s13591
backwards1173.39/s17859
back+cache1293.76/s19691


This is good. The backwards version runs 31% faster, and the cached version runs 45% faster!

If we already know the KC code is well-formed (that is, it will match the entire-code regex), then we can eliminate a lot of the code in the regexes. First, to match the last comment, forwards: ($last_comment) = $CODE =~ m{ (<[^<>]*>) (?: "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )* \z }x; And now backwards: $last = scalar reverse(reverse($CODE) =~ m{ \A (?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) )* (>[^<>]*<) }x); This benchmark gave me the following data:

Running for at least 15 seconds
Method Hz # iters
forwards1059.53/s15893
backwards2928.95/s44520
back+cache3514.13/s52712


Another tremendous difference. The backwards version is nearly 200% faster, and the cached version is almost 250% faster!

Better Efficiency in Reversal

A problem arises with the reversed code to match quoted expressions: m{ " (?: [^"\\]* .\\ )* [^"\\]* " }x; The problem is specifically that when Perl fails to match .\\, it will backtrack through ALL that was matched by [^"\\]* until it matches. If it doesn't match, then, and only then, does it go outside of the (?:...) subpattern. This is less than optimal.

In my opinion, Perl shouldn't waste that much time, since, for something to be matched by [^"\\] means it can't possibly match B. This is a gripe of mine.

To get around this, we employ the "cut" operator again. The final regex is m{ A (?: (?>[^AB]*) # match non-A's and non-B's WITHOUT BACKTRACKING (?: [AB] # match A or B next... | # ...or... (?<=.) # ...if there was a character before ) B # match B )* # 0 or more times [^AB]* # you know the rest A }x The following is a benchmark of the old reversed regex versus this new one. First, for the entire code search:

Running for at least 15 seconds
Method Hz # iters
back+cache1330.60/s19959
new back+cache1249.13/s18737


Aww. It seems that this new method is slightly slower, probably due to the fact that it has to use a look-behind, which might be expensive. And for the last comment search:

Running for at least 15 seconds
Method Hz # iters
back+cache1264.07/s18691
new back+cache3830.87/s57961


Whoa! 200% speed increase here. The look-behind is not so awful, then, if used sparingly. It's only used here until a comment is found. In the other half of this benchmark, it was used throughout the test contents.

Case Study: C Comments

Here's a very dumbed-down C syntax regex in Perl: m{ ^ (?: " [^"]* " | /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ | /[^*] | [^/"]+ )* ( /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ ) (?: " [^"]* " | /[^*] | [^/"]+ )* $ }x The really messy part is there for matching comments (see "Mastering Regular Expressions", by Jeff Friedl, or Perl FAQ 6). The other parts it matches are VERY simple double-quoted strings, forward slashes that aren't the beginning of a comment, and non-slash and non-quote characters. The problem with this regular expression is that the first part of it matches the entire C program, and then Perl has to backtrack to a comment. This can be slow and tedious (sadly). The equivalent sexeger has been tested to match at over 100 times the speed. This speed increase depends on the length and complexity of the C file being matched, but it is faster in general due to less need to backtrack.

Here is the benchmark and results.

Uses of Sexeger

Here's the fun part -- you get to see how these things really are useful.

Look-behind Assertions

as suggested by Anthony Guselnikov

If you've ever used look-ahead assertions, you know that you can say things like $foo =~ /\d+(?=[a-z]+\?)/ to match digits, that are followed by 1 or more lowercase letters and a question mark, without ACTUALLY matching the digits and question mark. Well, you can't do that with look-behind; you can only look-behind for constant-width assertions. (This may actually be linked to the sexeger idea.) To get around this, use a sexeger! # from the wishful $foo =~ s/(?<=AGE:\d+)\n/ /; # to the possible ($foo = reverse $foo) =~ s/\n(?=\d+:EGA)/ /; $foo = reverse $foo; Someone on PerlMonks said that they'd like a /r modifier to regexes, to do the string and match reversals, but let the USER do the actual regex reversal. It would be cool, I think. The only thing I notice is that this modifier would not be applicable to (?:); it would have to be a regex-wide modifier, since it cannot be determined ahead of time how long a chunk of the string would have to be reversed for the sexeger.

This SPECIFIC example of a look-behind work-around is probably less than optimal, but it sets the stage.

Inserting Commas in Numbers

An example of sexeger "before its time" can be found section 5 of the Perl FAQ. Andrew Johnson submitted a commify() function that reverses its input, does the substitutions, and then reverses the new string: sub commify { my $input = shift; $input = reverse $input; $input =~ s<(\d\d\d)(?=\d)(?!\d*\.)><$1,>g; return reverse $input; # should be scalar, really... } This is an efficient application of sexeger!

Where Do We Go From Here?

These still need to be explored, and this document is a work in progress on my web site (http://www.pobox.com/~japhy/). There are definite places where this technique offers a major speed increase, and places where it is just too much trouble to be worth it. Sexeger is a feature to be used on a per-case basis, and isn't an end-all to regex efficiency problems (not by a long shot).

Try it out, lose a couple of hours of sleep over it (I did), but have fun.