Sex, Eger!
or
Reverse Regular Expressions
- What is a Reversed Regular Expression?
- Applying "Sexeger"
- Better Efficiency in Reversal
- Case Study: C Comments
- Uses of Sexeger
- Look-behind Assertions
- Inserting Commas in Numbers
- Where Do We Go From Here?
Update! sheriff_p from #EFnet::perl will be giving
a Lightning Talk on this paper in my stead at
YAPC::Europe!
If we're able to look at the specific nodes of a regular
expression, then we should be able to create a regex that
works from the BACK of a string to the FRONT. What do I
mean?
($last_numbers) = $string =~ /(\d+)(?!\D*\d)/;
That regex finds the last occurrence of digits in $string.
Now, it's not all that bad, but it could be made a great
deal simpler by:
- reversing the input string
- reversing the nodes of the regex (optimize it, too)
- reversing the match
"How in the world is that simpler?" you ask. "Just watch!"
I reply.
$last_numbers = scalar reverse( # reverse the match
reverse($string) =~ # reverse the input string
/(\d+)/ # reverse the regex
);
Now, the regex (\d+)(?!\D*\d) can also be written
as (\d+)(?=\D*$). If digits aren't followed by
another digit, then they are followed by only non-digits,
if anything. In the process of reversal, the look-ahead
can be eliminated, since the regex would be \D*(\d+)
which has a redundant leading "node", \D*. The
regex's meaning is not changed by its removal.
We already have re.pm which can pick a regex apart
during compilation and during the actual matching stage.
This has the ability to pick apart individual nodes of a
regular expression.
Note: the term "node" refers to a combination of regex atoms
and quantifiers which can match a given substring backwards
or forwards. In the regex \d+[abc]def(foo|bar),
the nodes are:
- \d+
- [abc]
- d
- e
- f
- f
- oo
- b
- a
- r
This regex, reversed, would be (rab|oof)fed[abc]\d+
I will call reversed regular expressions "sexeger" (the reverse of the
string "regexes"). As Damian Conway suggested, "sexeger" follows the rule
of "sheep" or "fish" -- it is both in the singular and plural form.
For the mean time, there is no way to automatically generate a sexeger.
You have to do it by hand. But the benefits of it are amazing. Maybe
they need to be seen to be believed. If so, here is an example.
You have a simple programming language, called KC ("kinda
crummy"). It has a VERY simple grammar:
- comments are matched by <[^<>]*>
- strings are matched by "[^"\\]*(?:\\.[^"\\]*)*"
- everything else is matched by [^<>"]+
The string definition is a double-quoted strings with escapes
allowed in it, like "foo \"bar\" \\ blat". All programs
written in KC can be matched by the following regex:
$CODE =~ m{
\A
(?:
[^<>"]+ # non-string, non-comment
|
"[^"\\]*(?:\\.[^"\\]*)*" # string
|
<[^<>]*> # comment
)*
\z
}x;
We can write a regex to match an entire file (shown above).
Oh, before I go any further, here's the sample program:
int x = 10;
str y = "cool \" beans";
if (len(y) GREATER_THAN x) {
print y, " is longer than ", x, " characters";
<>
chop(y,x);
print "I sliced 'y' down to ", x, " characters for you";
}
Let's say we wanted to extract the LAST comment from this
program as well as be sure it was coded properly. Here's
my regex. Note: I used the "cut" regex operator
(?>...) which matches without allowing a backtrack,
to speed up the non-string, non-comment part.
($last_comment) = $CODE =~ m{
\A
(?: <[^<>]*> | "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )*
(<[^<>]*>)
(?: "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )*
\z
}x;
I'll also implement a reversed version. The reversed terms
for the grammar are:
- comments are matched by >[^<>]*<
- strings are matched by "(?:[^"\\]*.\\)*[^"\\]*"
- everything else is matched by [^<>"]+
$last = scalar reverse(reverse($CODE) =~ m{
\A
(?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) )*
(>[^<>]*<)
(?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) | >[^<>]*< )*
\z
}x);
If you look long enough (it shouldn't take too long) you'll
see that EACH node of the terms has been reversed properly.
Now, I ran a benchmark
and it showed that the reverse
method was actually faster than the forward version, EVEN
with having to reverse the input string AND the value of
$1. When I implemented a cached version, where the
reverse of $CODE was already stored, it was even
faster (obviously). Here are the wonderful results:
| Running for at least 15 seconds |
| Method |
Hz |
# iters |
| forwards | 894.14/s | 13591 |
| backwards | 1173.39/s | 17859 |
| back+cache | 1293.76/s | 19691 |
This is good. The backwards version runs 31% faster, and the
cached version runs 45% faster!
If we already know the KC code is well-formed (that is, it
will match the entire-code regex), then we can eliminate a
lot of the code in the regexes. First, to match the last
comment, forwards:
($last_comment) = $CODE =~ m{
(<[^<>]*>)
(?: "[^"\\]*(?:\\.[^"\\]*)*" | (?>[^"<>]*) )*
\z
}x;
And now backwards:
$last = scalar reverse(reverse($CODE) =~ m{
\A
(?: "(?:[^"\\]*.\\)*[^"\\]*" | (?>[^"<>]*) )*
(>[^<>]*<)
}x);
This benchmark gave me the following data:
| Running for at least 15 seconds |
| Method |
Hz |
# iters |
| forwards | 1059.53/s | 15893 |
| backwards | 2928.95/s | 44520 |
| back+cache | 3514.13/s | 52712 |
Another tremendous difference. The backwards version is nearly
200% faster, and the cached version is almost 250% faster!
A problem arises with the reversed code to match quoted expressions:
m{
"
(?: [^"\\]* .\\ )*
[^"\\]*
"
}x;
The problem is specifically that when Perl fails to match .\\, it
will backtrack through ALL that was matched by [^"\\]* until it
matches. If it doesn't match, then, and only then, does it go outside of
the (?:...) subpattern. This is less than optimal.
In my opinion, Perl shouldn't waste that much time, since, for something to
be matched by [^"\\] means it can't possibly match B. This
is a gripe of mine.
To get around this, we employ the "cut" operator again. The final regex is
m{
A
(?:
(?>[^AB]*) # match non-A's and non-B's WITHOUT BACKTRACKING
(?:
[AB] # match A or B next...
| # ...or...
(?<=.) # ...if there was a character before
)
B # match B
)* # 0 or more times
[^AB]* # you know the rest
A
}x
The following is a benchmark of the old reversed
regex versus this new one. First, for the entire code search:
| Running for at least 15 seconds |
| Method |
Hz |
# iters |
| back+cache | 1330.60/s | 19959 |
| new back+cache | 1249.13/s | 18737 |
Aww. It seems that this new method is slightly slower, probably due to the
fact that it has to use a look-behind, which might be expensive. And for the
last comment search:
| Running for at least 15 seconds |
| Method |
Hz |
# iters |
| back+cache | 1264.07/s | 18691 |
| new back+cache | 3830.87/s | 57961 |
Whoa! 200% speed increase here. The look-behind is not so awful, then, if
used sparingly. It's only used here until a comment is found. In the other
half of this benchmark, it was used throughout the test contents.
Here's a very dumbed-down C syntax regex in Perl:
m{
^
(?: " [^"]* " | /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ | /[^*] | [^/"]+ )*
( /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ )
(?: " [^"]* " | /[^*] | [^/"]+ )*
$
}x
The really messy part is there for matching comments (see "Mastering Regular
Expressions", by Jeff Friedl, or Perl FAQ 6). The other parts it matches are
VERY simple double-quoted strings, forward slashes that aren't the beginning
of a comment, and non-slash and non-quote characters. The problem with this
regular expression is that the first part of it matches the entire C
program, and then Perl has to backtrack to a comment. This can be slow and
tedious (sadly). The equivalent sexeger has been tested to match at over 100
times the speed. This speed increase depends on the length and complexity of
the C file being matched, but it is faster in general due to less need to
backtrack.
Here is the benchmark and results.
Here's the fun part -- you get to see how these things really are useful.
as suggested by Anthony Guselnikov
If you've ever used look-ahead assertions, you know that you can say things
like $foo =~ /\d+(?=[a-z]+\?)/ to match digits, that are followed
by 1 or more lowercase letters and a question mark, without ACTUALLY matching
the digits and question mark. Well, you can't do that with look-behind; you
can only look-behind for constant-width assertions. (This may actually be
linked to the sexeger idea.) To get around this, use a sexeger!
# from the wishful
$foo =~ s/(?<=AGE:\d+)\n/ /;
# to the possible
($foo = reverse $foo) =~ s/\n(?=\d+:EGA)/ /;
$foo = reverse $foo;
Someone on PerlMonks said that they'd
like a /r modifier to regexes, to do the string and match reversals, but let
the USER do the actual regex reversal. It would be cool, I think. The only
thing I notice is that this modifier would not be applicable to (?:);
it would have to be a regex-wide modifier, since it cannot be determined
ahead of time how long a chunk of the string would have to be reversed for the
sexeger.
This SPECIFIC example of a look-behind work-around is probably less than
optimal, but it sets the stage.
An example of sexeger "before its time" can be found section 5 of the Perl
FAQ. Andrew Johnson submitted a commify() function that reverses
its input, does the substitutions, and then reverses the new string:
sub commify {
my $input = shift;
$input = reverse $input;
$input =~ s<(\d\d\d)(?=\d)(?!\d*\.)><$1,>g;
return reverse $input; # should be scalar, really...
}
This is an efficient application of sexeger!
These still need to be explored, and this document is a work in progress on
my web site
(http://www.pobox.com/~japhy/).
There are definite places where this technique offers a major speed increase,
and places where it is just too much trouble to be worth it. Sexeger is a
feature to be used on a per-case basis, and isn't an end-all to regex
efficiency problems (not by a long shot).
Try it out, lose a couple of hours of sleep over it (I did), but have fun.