=head1 Regex Arcana =head2 Draft 2 July 1, 4:42 PM =head2 Introduction Perl's regular expressions (hereafter known as "regexes") have come a long way since version 1.0. They're like the eccentric uncle who spends a few years sojourning somewhere exotic and comes back older, wiser, and more worldly. Or perhaps, in Perl's case, more out-of-this-worldly. Perl's regexes aren't what computer science purists would call "regular" -- that idea was thrown out the window when back-references became something you could include in the pattern itself, as shown in this simple dynamic regex: # find any any doubled words my $text = "For score and and seven years ago"; while ($text =~ /\b(\w+)\W+\1\b/g) { print "'$1' repeats itself"; # 'and' repeats itself } This regex is self-referential: the I C<\1> tells the regex engine to match whatever has been captured into $1. This means the regex doesn't know for certain what it will match until run-time. And this is only the tip of the iceberg. =begin comment But Perl has not lost sight of other regex features, such as support for Unicode. During the first few months of the year, I spent a good deal of time bolstering Perl's compatibility with the Unicode Technical Standard for regexes (F). While I was neck-deep in the source, I also added some Unicode features that I felt made Perl's Unicode class syntax more useful. =end comment I'm going to share X two pieces of regex arcana with you -- features of the Perl regex engine that aren't used often, but offer those who can master them great power. =head3 Code Evaluation The code evaluation assertion C<(?{ CODE })> was introduced in Perl 5.005, but it was rough around the edges. In Perl 5.6, it was fine-tuned, and more adjustments were made by 5.8. The code evaluation assertion allows a regex to execute an arbitrary block of Perl code during the course of its pattern matching. This device becomes particularly interesting when one realizes that the backtracking stack is much like a variable scoping stack. =head3 Delayed Execution The delayed execution assertion C<(??{ CODE })> is the heart of dynamic regexes. Using them is like walking across a bridge as you're building it. Their true beauty is revealed when you create a I object (via C) with a delayed execution assertion in it... and inside that assertion is the self-same I object. Then you have created a recursive regex, one that is capable of matching nested data structures or acting like a proper grammar. =begin comment =head3 Unicode Classes If you've used Unicode at all in Perl, you've probably dealt with the C<\p> or C<\P> Unicode Class syntax in your regexes. Well, in Perl 5.8.4, there's more support for Unicode classes, but that's just the start. You can also now build your own hierarchical Unicode classes, and "inherit" from them with ease. =end comment =head2 Getting Up to Speed This article is going to use some regex variables you might not be familiar with, so let me introduce them: =over 4 =item $+ the most recently opened capture group =item $^N the most recently closed capture group =item @- the beginning offsets of $& and capture groups =item @+ the ending offsets of $& and capture groups =back Here is a simple example demonstrating these variables: "perl" =~ /(.(..).)/; # $1 = "perl" # $2 = "er" # $+ = "er" (in this case, $2) # $^N = "perl" (in this case, $1) # @- = (0, 0, 1) # @+ = (4, 4, 3) After a successful pattern match against $str, the pair of values $-[I] and $+[I] hold the offsets in $str of the Ith capture group. $-[0] and $+[0] are the offsets for $&. substr($str, 0, $-[0]); # produces $` substr($str, $-[0], $+[0] - $-[0]); # produces $& substr($str, $+[0]); # produces $` substr($str, $-[1], $+[1] - $-[1]); # produces $1 substr($str, $-[2], $+[2] - $-[2]); # produces $2 (etc.) You are probably aware that $& and its friends are bad news. They cause reductions in the speed of your programs, so use them only when debugging. If you use them once in your code, you might as well use them as many times as you can in that code, because you've already suffered the hit. =head2 Using C<(?{ ... })> Our first incantation is I. It will execute the code inside when it is encountered, and then continue with the regex; they always succeed, unless their contents interrupt the execution of your program. Let's look at some applications of this assertion. =head3 Inspecting a Regex's State It has been said that "you can't observe the behavior of a system without affecting the system's behavior". Not so, in Perl 5.8. The following regex uses code assertions to show how a particular optimization in the engine works. # the /x modifier allows embedded whitespace # to help improve readability and clarity "salad" =~ m{ .* (?{ print "[$&] " }) a .* (?{ print "($&) " }) a }x; # output: # [sal] [s] (sal) We often see backtracking as a character-at-a-time procedure, but it's really much more efficient than that. In this case, something like C<.*a> makes the engine jump from one 'a' to the next when matching or backtracking. The code assertion doesn't get in the way of this, even though it's in between the C<.*> and C. (Sadly, in Perl 5.6, it did get in the way. Try that code in 5.6, and you'll see it stops the optimization from occurring.) You have access to the regex variables inside a code assertion; my example above uses $&, although it's general practice to avoid that variable. You can also access $1 and other digit variables, $^N, $+, and the arrays @- and @+. One caveat, though, is that you can't access a capture group until it has been completed; that is, the closing parenthesis for $1 must be passed before you can see the contents of $1 in a code assertion, otherwise you'll be seeing its previous value. This code won't print anything inside the quotes, because $1 hasn't been closed before it's printed: "perl" =~ m{ ( (?: . (?{ print "'$1'\n" }) )+ ) }x; In order to inspect the digit variables as they're being created, you need to be cunning: create a variable inside a code evaluation immediately prior to the capture group that stores the current location in the string. Then, use C on $_, which is a copy of the string you're matching against inside a code evaluation: "perl" =~ m{ (?{ $p = $-[0] }) ( (?: . (?{ print substr($_, $p, $+[0] - $p), "\n" }) )* ) }x; That $_ trick is undocumented as far as I can tell, which means it might not stay that way in the future. (But considering Perl 6 regexes will be radically different, we should have fun while we can.) In this way, code evaluation assertions are good debugging tools, because you can use them without interfering with the execution of the regex, and you can get helpful diagnostic information, like where you are in the string, what you've captured, etc. =head3 Capturing Repetitions I've often been asked why a regex like C only returns the last word character captured: "japhy" =~ /(\w)+/ and print "<$1>"; # 'y' The reason is because the regex is continually storing what C<\w> matches to $1, and each time, it's overwriting the old contents. It won't create $2, $3, etc., and it doesn't create a @1 array. So the question remains: how can I keep track of repeated capture groups? The answer works on the principle that the matching and backtracking in a regex is very similar to a variable scoping stack, if you use C on your variables inside code evaluations. C works differently in a regex; it gives a variable a value that remains until that point is backtracked past. Let's say we wanted to count how many characters we matched before the last 'r' in a string. Here's code that doesn't use C: # not local()ized "perl" =~ m{ (?{ $x = 0 }) (?: \w (?{ ++$x }) )* r }x and print "x = $x\n"; It prints 4 as the value of $x, even though there are only 2. This time, we'll use C on a temporary variable, and assign its value to $x only once the regex has succeeded: # local()ized "perl" =~ m{ (?{ local $_x = 0 }) (?: \w (?{ local $_x = $_x + 1 }) )* r (?{ $x = $_x }) }x and print "x = $x\n"; This time the value is reported correctly as 2. The first code prints 4 because when the backtracking occurs, $x's value isn't rewound or reverted to its previous one. The second code prints 2 for the opposite reason: because of the use of C, when backtracking occurs, the scoped variable $_x reverts to its previous value. Think of it as your regex being a recursive function. Although it requires a bit more coding on your part, this technique can be extended to let you keep track of capturing groups that have quantifiers on them: "age=22 name=jeff state=NJ" =~ m{ (?{ local (@_1, @_2) }) (?: ([^=]+) = (\S+) \s* (?{ local @_1 = (@_1, $1); local @_2 = (@_2, $2); }) )* (?{ (@1 = @_1), (@2 = @_2) }) }x; The code above produces two arrays, @1 and @2, whose contents are "age", "name", "state", and "22", "jeff", "NJ". If you're wondering why I didn't just write C, it's because I have to make a new locally scoped array, and C wouldn't do that. You I use C every time you need a variable to revert to its previous value when it's backtracked past. =head3 Using $^R to Simplify Things There's another regex variable, $^R, which stores the value of the last C<(?{ ... })> assertion. Perl is nice and makes $^R scope properly, so if your regex backtracks, $^R will revert to its previous value. This means we can write a much simpler looking regex: "perl" =~ m{ (?{ 0 }) # sets $^R to 0 (?: \w (?{ $^R + 1 }) # sets $^R to $^R + 1 )* r (?{ $x = $^R }) # sets $x to $^R }x and print "x = $x\n"; Here we don't need to worry about localizing our variables, because $^R is properly dealt with by Perl. You need to be careful about how you modify it, though. We could not have written C<(?{ $^R++ })> or C<(?{ ++$^R })> in our regex, because both of those I to $^R. You usually don't want to assign to $^R I; just make the last value in your code assertion be the value you want $^R to have. Perl builds up a stack of the values returned by code evaluations, and rolls that stack back when backtracking occurs. Setting $^R explicitly interrupts that stacking procedure. We can use $^R with our example that creates @1 and @2 as well. We have to be careful in how our code assertions work: "age=22 name=jeff state=NJ" =~ m{ # @1 @2 (?{ [ [], [] ] }) (?: ([^=]+) = (\S+) \s* (?{ [ [ @{$^R->[0]}, $1 ], [ @{$^R->[1]}, $2 ], ] }) )* (?{ @1 = @{ shift @{$^R} }; @2 = @{ shift @{$^R} }; }) }x; You cannot rely on what value $^R will have outside of your regex, so you should always get the data from it at the very end of your regex, once you are sure it has succeeded. =head3 If-Then Assertions You can also use code assertions as the condition to an if-then assertion. If you're not familiar with them, they look like C<(?(I)I|I)>. If the I part evaluates to a true value, then the I pattern is matched; otherwise the I pattern is matched. If the selected pattern fails to match, the if-then assertion fails altogether. The I part can be a look-ahead, a look-behind, a number referring to a capture group, or a code assertion. If it is a code assertion, $^R does I get its value changed. Here's a regex that matches $_ if it is an integer less than 100. my $less_than_100 = qr{ ^ ( -?\d+ ) $ # match the entire number (?(?{ $1 >= 100 }) # if $1 >= 100... (?!) # then fail ) # otherwise it succeeds }x; if ($n =~ $less_than_100) { # $n is less than 100 } =head2 Using C<(??{ ... })> Our second gimmick is the I. It acts very much like a code evaluation assertion, but the return value is returned to the regex as something to be matched. Consider our first example, which finds words that are repeated; here's a way to write it using a delayed execution assertion: while ($text =~ /\b(\w+)\W+(??{ $1 })\b/g) { print "'$1' is repeated\n"; } Because $1 only contains word characters, I didn't need to worry about any regex-specific characters being interpolated incorrectly. However, if we change from "words" to "non-whitespace chunks", we'll need to be safer: while ($text =~ m{ (? in this regex, it's because I'm not looking for I boundaries, so I can't count on it working properly. But that's a very silly use of a very powerful feature. It's not very economically sound to use your car to drive one city block, and this assertion is an SUV. So let's go off-roading. =head3 Self-Constructing Regexes One of the first uses I found for this assertion was to match a chunk of unique characters in a string. Take, for example, "the quick brown fox jumped over the lazy dog". If I want to grab the first chunk of unique characters, I use an assertion that builds a character class based on the characters it has already matched: my $unique = qr{ . # match any character (??{ # evaluate and match... "[^\Q$&\E]" # a self-modifying char class })* # zero or more times }sx; my ($uniq_str) = $string =~ /($unique)/; It's important to realize that we're not matching the I character class over and over again. The delayed execution assertion is being re-executed I time. We can see what the character class looks like if we want: my $unique = qr{ . (??{ print my $cc = "[^\Q$&\E]"; $cc; })* }sx; my ($uniq_str) = $string =~ /($unique)/; On "the quick brown...", this gives the output: [^t] [^th] [^the] [^the\ ] [^the\ q] [^the\ qu] [^the\ qui] [^the\ quic] [^the\ quick] It stops when it reaches the second occurrence of a space, because the character class fails to match something it hasn't seen before. If we wanted to ignore spaces, we could simply add the ability to match any whitespace characters before we try matching a new character: qr{ . (?: \s+ | (??{ "[^\Q$&\E]" }) )* }xs This time we would match "the quick brown f". =head3 Dynamic Quantifiers You can use this assertion to make a quantifier based on text you've matched. If you have a string like "2aaaa 6a 3aaa 5 1a", and you want to match a number if it is followed by exactly that many a's, you capture the number and then make use a delayed execution assertion which uses that number as its quantifier: @matches = $data =~ m{ \b (\d+) (??{ "a{$1}" }) (?! a ) }gx; You could be more efficient and use C<"a" x $1> instead of C<"a{$1}">, but the example can be expanded to match a lower and upper limit: # match X,Y, then match between X and Y a's $data = "1,3aa 2,5aaaaaaa 2,7aaaa"; @matches = $data =~ m{ \b (\d+) , (\d+) (??{ "a{$1,$2}" }) (?! a ) }xg; Of course, you'd have to make sure $1 isn't greater than $2, but that's left as an exercise to the reader. Let's look back to our "is X less than 100?" example. We can re-write this with C<(??{ ... })> instead: if ($num =~ m{ ^ (-?\d+) $ (??{ $1 >= 100 and '(?!)' }) }x) { # $1 is less than 100 } Here, if $1 is equal to or greater than 100, we insert C<(?!)> into the regex, an empty negative look-ahead, which always fails. If $1 is less than 100, than our C<< >= >> operation returns "" as its false value, and the regex succeeds. =head3 Recursive Regexes More often than not, you'll see these assertions used to create recursive regexes, ones that can match arbitrarily deeply-nested parentheses, etc. To get this to work, you create a F object, and then put that object inside a delayed execution assertion: # if you're using a lexical, it MUST be # declared before it is assigned to my $px; $px = qr{ (?: (?> [^\\()]+ | \\. ) | \( (??{ $px }) \) )* }xs; Now we can make sure a string is properly ()-balanced: my $good = 'a + (b / (c - 2)) * (d ^ (e+f))'; my $bad1 = 'a + (b / (c - 2) * (d ^ (e+f))'; my $bad2 = 'a + (b / (c - 2)) * (d) ^ (e+f))'; print "not " if $good !~ /^$px$/; print "ok\n"; print "not " if $bad1 !~ /^$px$/; print "ok\n"; print "not " if $bad2 !~ /^$px$/; print "ok\n"; The first is reported ok, the second and third are reported as not ok (which is what we would expect). Using this recursive mechanism, we can create a regex that matches a I. A palindrome is a string that reads the same forwards and backwards, like "toot", or "madam, I'm adam". Punctuation and case are ignored. my $pdrome; $pdrome = qr{ (\w) \W* (??{ $pdrome })? \W* \1 }xi; This matches "toot", but not "rotor". Can you see why? There is a boundary case: if the string has an odd number of letters, then the turning point letter won't be repeated. Let's add support for that: my $pdrome; $pdrome = qr{ (\w) \W* (??{ $pdrome })? \W* \1 | \w }xi; Now we can match "Sit on a potato pan, Otis". =begin comment =head2 Using Unicode =head3 New Class Syntax for C<\p> and C<\P> =head3 Creating Your Own Classes =end comment =head1 The Future: Perl 6 So those are the two work-horses of Perl's eccentric regex implementation. In Perl 6 (F), these features won't disappear. In fact, the code evaluation assertion will probably become more and more prevalent. Here's a quick taste of how Perl 6 will make these assertions easier to write and simpler to understand. =head2 Code Evaluation In Perl 6, regex metacharacters will change meaning to make things clearer. Code evaluation assertions are now just enclosed by curly braces. Our regex earlier that matches a number that is less than 100 could be written in Perl 6 like this: / ^ (\d+) $ { $1 < 100 or fail } / The first thing you should notice is that, with all that whitespace, there is no C modifier on my regex. That's because it will be the default. Next, you can see the code evaluation is C<{ code }>, which looks like a closure (because it is). Finally, even though this is merely a code evaluation, it can affect the regex by calling C. =head2 Delayed Execution But the syntax C<{ COND or fail }> has a shorthand in Perl 6, by way of the new look of the delayed execution assertion, C<< <...> >>. You'll be able to use C<< <$rx> >> to get the same affect of C<(??{ $rx })>. A special case of this assertion is C<< <(...)> >>, which regards its internals as an expression that causes the match to fail at its current point if it evaluates to false: / ^ (\d+) $ <( $1 < 100 )> / =head2 Backtracking Control Finally, a note on efficiency. Our regex here is currently less efficient than it should be. After a number like "300" fails, Perl backtracks the C<\d+> to match "30", which fails because "30" is followed by "0", not the end of the string. But Perl still does this needless backtracking. In Perl 5, we would use C<< (?> (\d+) ) >>, the "cut" assertion, which prevents internal backtracking. In Perl 6, you'll be able to say it more succinctly with a colon (or two, or three!). In our case, we'll want two colons, which says that this set of alternatives is all to fail if we have to backtrack past the C<::>. Since we our regex is has no alternatives in it, it means this match will fail if the C<::> is backtracked past. / ^ (\d+) $ :: <( $1 < 100 )> / We can get even sleeker -- if we use C<< >> instead of C<::>, the entire I will fail, not just the match at the current location. This will eliminate our need for anchors: / (\d+) <( $1 < 100 )> / =head1 Food for Thought With your new knowledge, you should try tackling these two tasks. First, construct a regex that determines the I sequence of unique characters in a string. Now, decipher the goal of this regex. It uses both kinds of code assertions, the if-then assertion, I the $^R variable. my %data = $str =~ m{ (?{ [ "", "", "" ] }) ( [^\s=]+ ) \s* = \s* (?(?=["']) (?{ my $c = substr($_, $+[0], 1); [ $c, "[^\Q$c\E]*", $c ] }) | (?{ [ "", '\S+', "" ] }) ) (??{ $^R->[0] }) ((??{ $^R->[1] })) (??{ $^R->[2] }) }xg; To see its dissection, come to F. =head1 Author Jeff C Pinyan has been so busy improving Perl's Unicode support, writing F, working in Princeton, NJ, and spending his spare time with his family and his fiancEe, he totally forgot about solving world hunger. If you need to chastise him, you can email him at F.