Regular Expressions, A Favorite Parsetime

Data from Hell

To the data-processing programmers among us, the terms ``CSV'', ``delimiter'', ``munge'', and ``massage'' are common and well-known. But the truly fortunate of these data-processors know the term ``regular expression'', which I lovingly (if convenience is grounds for love) abbreviate as ``regex''. A regex is a programming language unto itself -- it has input (some string to execute upon), the source (the physical regex), and output (stating whether the regex matched, and perhaps what text was matched).

Many languages offer regex support... some better than others. But no matter the support, regexes offer us a hyper-reductionalist approach to matching text that conforms to some pattern. That's what the language of regexes is for -- they apply a pattern to a piece of data and see what results. Regexes are a big function: ``does this data conform to this rule?'' The benefit of regexes is that they are small and concise; anyone who has tried parsing text files with C knows the effort involved: countless lines of code, special cases, loops, switch statements, backtracking to correct errors, keeping track of pointers to strings... it's a headache. Worse, it's a migraine.

Regexes are a headache, but they're a much less severe kind. Regexes are a headache because they look like gibberish, but they get the job done with much less effort. Consider this seemingly simple task:

Parse a string whose fields are separated by commas. However, commas can appear in quoted strings where they are not seen as field separators. A quote can be embedded in a quoted string by doubling it.

That's one possible definition of a list of comma-separated values, or CSV for short. (It's important not to confuse CSV, which is a common data storage method, with CVS, the drugstore down the street. You don't need me to tell you hard it is to match CVS with a regex.) The task of parsing CSV with a language like C is not simple, because of the special cases. If you use a regex (well, two, actually) you can get the job done quickly and easily. The most important feature of regexes is that once you know how to write them, you can write them in a mere fraction of the time it would take to create the huge parsing engine that they stand for.

  # Perl code for parsing CSV (as per the above definition)
  $string = 'this,is,"good ""Perl"" code",for,"parsing CSV"';
  @fields = $string =~ /(?:^|,)("(?:[^"]+|"")*"|[^,]*)/g;
  for (@fields) {
    if (s/^"//) { s/"$//; s/""/"/g; }
  }
  # @fields =
  #   ('this', 'is', 'good "Perl" code', 'for', 'parsing CSV')

If the code above might seem confusing, it's only because you haven't been initiated into the Order of the Regex. The Rites and Rituals to follow will help you enter into this mystical society. If you wish to parse your data in an easier way, and spread understanding of the Great Spencerian Language, then by all means, read on.

Say it with me now: /^([^aeiou])[aeiou]\1\1$/

A regex is a program. A really tiny program. Its ``functions'' are one to three characters in length. Add to that the provision that alphanumeric characters are (more or less) off-limits, and you have ``line noise in a can''. The task of understanding a regex requires you to know what symbols mean what, and where they mean it.

By the way, the regex in the header for this section matches strings like ``sass'', ``5e55'', ``lull'', and ``#o##''.

Pre-basic basics

In Perl, a regex is commonly bounded by the / characters: m/pattern/ denotes a pattern match. The m is optional, and I'll be omitting it. To match a pattern against a string, you use the =~ operator:

  if ($text =~ /pattern/) {
    # the text was matched by the pattern
  }

To invert the test (meaning, to see if the text is not matched by the pattern) you can use the !~ operator.

If you are evaluating a regex on $_, you can leave off the $_ =~ part of the expression:

  if ($_ =~ /pattern/) { ... }
  # is synonymous to
  if (/pattern/) { ... }

Normal characters

The simplest symbols in a regex are plain-text characters; the alphanumerics and the underscore. These match themselves, plain and simple. The regex /jeff/ will try to find the string ``jeff'' in whatever input it is given; the regex engine looks for a 'j' followed by an 'e' followed by an 'f' followed by another 'f'. If that can be done, the pattern has matched successfully; if it cannot be done, the pattern has failed.

  print "Enter your name: ";
  chomp($name = <STDIN>);
  if ($name =~ /J/) {
    print "Your name has a capital J in it!\n";
  }

That simple program will print an obvious statement when the user enters text like ``Jeff'' or ``James'' or ``Abdul Jamal''; no message will be given for text like ``David'' or ``Becky'' or ``jar jar binks''. (No self-respecting program would ever warrant ``jar jar binks'' with a response.) As you can tell, regexes are case sensitive -- the pattern /J/ matches a capital ``J'' and not a lowercase ``j''.

Metacharacters

Now we will see the mystic runes that mean more than they seem to be; they are called metacharacters. There are generally 13 of them:

  | [ . * + ? { } \ ^ $ ( )

The \ metacharacter is the easiest to explain. Use it to match any of the actual metacharacters; \* matches a literal ``*'', and \\ matches a literal ``\''.

Decisions, decisions, decisions

It'd be a real shame to have to use more than one regex to see if the name entered has a ``J'' or a ``j'' in it.

  # find a uppercase or lowercase "j"
  if ($name =~ /J/ or $name =~ /j/) { ... }

It's important to have a means of specifying options for a regex to take in the course of matching text. That is why alternation exists, by way of the | character (vertical-bar, pipe, whatever you want to call it). | is read as ``or'' to a regex. /J|j/ means ``match a 'J' or a 'j'''. You can use more than one alternation in a regex:

  # find a uppercase or lowercase "j"
  if ($name =~ /J|j/) { ... }
  if ($name =~ /Jeff|James|Abdul Jamal/) {
    print "Excellent choice for a name, sir.\n";
  }

It's important to see how ``weak'' the | metacharacter is. The regex /Jeff|James/ does not mean ``match Jeffames or JefJames'' -- the | isn't a choice between the ``f'' or the ``J''; rather, it is a choice between everything else, a choice between the ``Jeff'' or the ``James''.

If you do want to have a choice of one of a set of characters, there is another option: character classes. These are sets of characters found between [ and ]. As an example, the character class [aeiou] matches any of those five characters -- lowercase vowels. Ranges of characters can be abbreviated with a hyphen, so [a-z] is shorthand for ``all the lowercase consonants''. A class like [A-Za-z0-9] matches any alphabetic character or digit. Matching a literal ``['', ``]'', or ``-'' in a character class usually involves backslashing it -- putting a backslash (``\'') in front of it. The class [\[\]] matches a ``['' or a ``]''.

  # find a uppercase or lowercase "j"
  if ($name =~ /[Jj]/) { ... }
  # matches "Jeffames" or "JefJames"
  if ($name =~ /Jef[fJ]ames/) { ... }

You can invert the sense of a character class by placing ^ as the first character in it. This means ``not any of ...''; the class [^aeiou] would match any character (not just any letter, mind you) except ``a'', ``e'', ``i'', ``o'', or ``u''.

  # matches, Jaff, Jiff, J!ff, JJff, etc.
  if ($name =~ /J[^e]ff/) {
    print "What a bizarre name you have.\n";
  }

Matching anything (without writing everything)

Every now and then, you'll want your regex to match a character -- any character, really. Well, any character except newline. Anyway, Perl provides the . metacharacter as a shortcut for the character class [^\n]. It will match any non-newline character.

  # "J", something, "f", "f"
  if ($name =~ /J.ff/) { ... }

Repeating yourself (without repeating yourself)

It would be an even greater shame if you couldn't write a regex to match four letters in a row without having to say /[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]/. That is why quantifiers exist -- they allow you to express repetition of a symbol in your regex. /[a-zA-Z]{4}/ matches four letters in a row.

There are four quantifiers: *, which matches zero or more; +, which matches one or more; ?, which matches zero or one; and {n,m}, which is used for more specific ranges. Quantifiers proceed the item the quantify -- /a*/ matches zero or more ``a''s. They bind very tightly (as compared to | which is very weak); /ab*c/ matches an ``a'' followed by zero or more ``b''s followed by a ``c''.

  1. The * quantifier often causes confusion to beginners -- it can successfully match by matching zero times. That means that "Jeff" =~ /a*/ returns true; ``Jeff'' contains at least zero consecutive ``a''s (there are zero ``a''s at the front of the string, right before the ``J''). Even more surprising, though, is that "Barry" =~ /a*/ returns true, but only because there are zero ``a''s at the front of that string! Since /a*/ has the ability to match zero times, it'll do it as soon as possible, which means as early in the string as possible. The regex /a*r+/, on the other hand, matches successfully at the ``arr'' in ``Barry''. But beware, /a*r*/ would match at the beginning of both strings -- this is the trickery of the * quantifier.

  2. The + quantifier is probably less of a nuisance, because it requires at least one occurrence. "Jeff" =~ /a+/ fails, because there is not at least one consecutive ``a'' in ``Jeff''. "Barry" =~ /a+/ succeeds as one would expect.

  3. The ? quantifier allows something to be matched, but optionally. To match both ``boat'' and ``bat'', for example, you can use the regex /bo?at/.

  4. The {n,m} quantifier is certainly the most versatile -- it comes in three forms. First, {n} matches exactly n times; /jef{2}/ matches ``jeff''. Second, {n,} matches at least n times; /jef{2,}/ matches ``jeff'', ``jefff'', ``jeffffff'', etc. Third, {n,m} matches at least n times, and no more than m times; /jef{2,4}/ matches ``jeff'', ``jefff'', and ``jefff''.

An interesting thing to note about the { and } metacharacters is that they're only metacharacters when they themselves match a certain pattern! The regex /{a}/ is not a syntax error -- it matches the text ``{a}''. In the same way, /x{1,2,3}/ matches ``x{1,2,3}''. Perhaps { and } are really metametacharacters.

Grouping things together

Quantifiers are nifty, and so is alternation, but we have a bit of a problem. How can we express ``match 'Jeffrey' or 'Jeff' or 'James''' without writing it all out? Sure, /Jeffrey|Jeff|James/ works, but we really want to say that the ``rey'' part is optional. We can't just put a ? after the ``y'', because then only the ``y'' is optional; and we can't put a ? after the ``r'', ``e'', and ``y'', because then they are only separately optional. We don't want to match ``Jeffry'' or ``Jeffe''.

For this reason, grouping parentheses allow us to partition off a set of symbols as a sub-pattern, to be treated as one big symbol. This allows us to place a quantifier on a group of symbols: /Jeff(rey)?|James/. We could even take this a step further and do /J(eff(rey)?|ames)/.

Weighing anchor

In our early example about looking for the letter ``J'' in a string, we were matching the ``J'' anywhere in the string -- n'importe où, as the French would say (if they say that sort of thing). But what if we wanted to ensure the string started with the letter ``J''? Or ended with the letter ``f''? For that, we would need to use anchors.

Just like a real anchor does for a ship, regex anchors ensure the location of a where a regex can match in a string. The two anchors you'll see most often are ^ and $, which match at the beginning and end of a string (respectively). The regex /^J/ will match any string that starts with ``J'', such as ``Jeff'' or ``James'', but not ``Abdul Jamal''. Likewise, /f$/ matches any string that ends with ``f'', such as ``Jeff'' or ``Biff'', but not ``Jeffrey'' or ``Buffy''.

Both anchors can be used together -- I can use /^Jeff$/ to make sure the string entered was ``Jeff''.

Caveat regexor! The $ anchor means a bit more than just ``end of string''; it also matches right before a newline at the end of a string. That means that both ``Jeff'' and ``Jeff\n'' are matched by /^Jeff$/, but not ``Jeff\n\n''.

More shortcuts

Perl's regex syntax offers convenient shortcuts for common character classes. Digits, matched by [0-9], can also be matched by \d. Word characters, which include alphanumerics, digits, and the underscore (``_''), can be matched by \w. This has a hidden convenience, in that characters in your local alphabet (such as ù in France) are matched by \w, while they are not matched by [a-z]. All forms of whitespace, \n (newline), \t (tab), \r (carriage return), \f (form feed), and a literal space, are matched by the \s shortcut.

  # see if a phone number looks good (no area code, though)
  if ($phone_number =~ /^\d{3}-\d{4}$/) { ... }
  # a phone number with area code!
  if ($phone_number =~ /^\d{3}-\d{3}-\d{4}$/) { ... }
  # a phone number with area code (another way)
  if ($phone_number =~ /^(\d{3}-){2}\d{4}$/) { ... }

These shortcuts can all be inverted (as one would invert a character class with a ^) by capitalizing the letter; \D matches non-digits, and so on.

Back-references and return values

The parentheses used to group sections of a regex do more than just that, they also capture the text matched by the sub-pattern. This text is stored in variables $1, $2, $3, etc., where each variable $N represents the Nth set of parentheses.

  if ($phone_number =~ /^(\d{3})-(\d{3})-(\d{4})$/) {
    print "Area code: $1\n";
    print "Exchange: $2\n";
    print "The rest: $3\n";
  }

Each successfuly regex sets these variables; a failed regex doesn't change their values.

  "Jeff"  =~ /\w(\w)/;  # sets $1 to "e"
  "japhy" =~ /\d(\d)/;  # $1 stays "e"

In addition to the $1 variable, the contents of what a parenthesized group has matched is available inside the regex as \1, \2, etc. This allows you to match things like doubled letters:

  if ("beet" =~ /(\w)\1/) {
    print "The letter '$1' is doubled\n";  # 'e'
  }

This lets you build the regex as you go along (something a regular expression shouldn't have the ability to do). You can match words that have been repeated (by mistake, perhaps):

  if ("paris in the the spring" =~ /(\w+)\W+\1/) {
    print "The word '$1' was doubled.\n";  # 'the'
  }

The return value of a regex depends on the context the regex was evaluated in: scalar context will return a true (1) or false ('') value, depending on whether the regex succeeded or not; list context will return the values of $1, $2, etc. variables. This is particularly useful in changing these lines:

  $phone_number =~ /^(\d{3})-(\d{3})-(\d{4})$/;
  ($area, $exch, $rest) = ($1, $2, $3);

into:

  ($area, $exch, $rest) =
    $phone_number =~ /^(\d{3})-(\d{3})-(\d{4})$/;

This has a major advantage: if the regex does fail, the three variables will be set to undef each, since a regex that fails returns an empty list, in list context. Using the two-statement version, those three variables might get older values of $1, $2, and $3.

In addition to the simple (...), Perl also offers the funnier-looking (?:...) parentheses. These act like (...), but without the ``capturing'' effect.

  if ("Jeffrey" =~ /(?:\w\w)+(\w)/) {
    print "Last 'odd' letter: $1\n";
  }

Greed

The quantifiers in Perl act greedily -- they want to match as much as they can. At the same time, Perl wants to match as early in the string as possible. This is why "Jeff" =~ /e*/ matches the zero ``e''s at the beginning of the string, rather than the actual ``e'' in the string. However, a regex like "Jeff" =~ /J*/ will match the one ``J'' at the beginning of the string.

It is important to know that ``greed'' does not mean that the engine will look for the longest possible match (as can be seen by the above example). If you try to extract the longest string of ``a''s from a string with $str =~ /(a+)/, you'll find that Perl returns only the first selection that matches, not necessarily the longest. (To do this task requires using the /g modifier, and modifiers aren't covered by this article.)

Quantifiers can be made non-greedy by placing a ? after the quantifier. A non-greedy quantifier attempts to match as little as possible before matching more. Here is a comparison between greed and non-greed:

  "dandelion" =~ /(.*)n/;   # $1 is 'dandelio'
  "dandelion" =~ /(.*?)n/;  # $1 is 'da'

A common misconception about greed is its interaction with the $ anchor.

  "dandelion" =~ /d(.*?)$/;  # $1 is 'andelion'

Many beginners expect $1 to be ``elion'', because ``elion'' is shorter than ``andelion''. But greed doesn't change where the engine starts looking, it changes how much the engine tries matching. * by default tries to match an infinite number of times, as does +; {3,6} tries matching 6 times. When these are non-greedy, * matches 0 times at first, + matches 1 time at first, and {3,6} matches 3 times at first.

  "abacus" =~ /(\w{3,6})/;   # $1 is 'abacus'
  "abacus" =~ /(\w{3,6})?/;  # $1 is 'aba'

Extra features

Perl's regex engine boasts the most features of any regex flavor. There are dozens of assertions, such as a ``look-ahead'' ((?=pattern)) that allows you to match part of a regex without actually advancing in the string, and arbitrary code evaluation ((?{ code })) that lets you execute any block of code at a given point in a regex. These are beyond the scope of this article, but the Perl documentation has plenty of details in perlre.

A Complete Regex

We now have pretty much every tool necessary to write a real-application regex. The topic at hand is parsing mail headers, in order to filter out spam. The first thing we need to understand is the format of an email. There are two sections, the headers and the body, and they are separated by a blank line (a line consisting solely of a newline). The header lines look like ``Header: value''. We're interested in storing all these headers in a data structure (a hash is most suitable for this) and then checking particular headers for key phrases, like ``FREE'' or ``!!!!'' or ``STRONG BUY'' (I've been getting a rash of suggested stock market purchases lately).

(This code does not support duplicated headers (such as ``Received:'') or headers that span multiple lines.)

  # assuming the message is coming in on standard input
  my %headers;
  while (<STDIN>) {
    # /^$/ matches an empty line
    # which signals the end of headers
    last if /^$/;
    # extract the type of header and its value
    # or else go to the next line
    my ($type, $value) = /([^:]+): +(.*)/
      or next;
    $headers{$type} = $value;
  }

Now that our hash is formed, we can check certain fields for key phrases:

  if ($headers{From} =~ /\@hotmail\.com/) {
    # handle mail from hotmail.com
  }
  if ($headers{Subject} =~ /FREE|!{2,}|STRONG BUY/) {
    # "FREE", "STRONG BUY", or 2 or more !'s is a red flag
  }

That's a bare-bones email-header parser. There are far more robust solutions out there, but this is an example of a practical application of regexes.

The History and the Magic

Well, ok, the heading is a lie. Regexes aren't magic -- they're scientific at heart. From the first principles they were founded on, to the theory behind regex engines, math and science are the building blocks of the regex. Regex theory was developed several decades ago, but one man, Henry Spencer, put the theory into a distributed package, allowing programmers everywhere to use a standard set of regex syntax.

Spencer's Gifts

The name ``Henry Spencer'' should be as revered for users of regexes as the name ``Larry Wall'' is for Perl programmers, and ``Brian Kernighan'' and ``Dennis Ritchie'' are for C programmers. Although Spencer was the creator of the first distributed regex library, the theory behind regexes was formalized in the 1950s by Stephen Kleene. Kleene first used the term ``regular expressions'', and developed formal definitions of finite state machines (which are the heart of regex engines).

Regexes became a commonplace tool in the Unix environment early in Unix's lifetime. Early text editors provided them as a means of finding text and making changes to text without having to search through the text manually. Utilities such as grep, sed, and awk appeared in the 1970s, and have remained in distributions. Their names are as baffling as the syntax of regexes themselves. grep was derived from a command from ed, g/regular expression/p, which means ``global regular expression print'', and displays the lines of a file that match a given regex. sed is an acronym for ``stream editor''. awk got its name from the initials of the last names of its authors, Alfred Aho, Peter Weinberger, and Brian Kernighan.

It wasn't until 1986 that Spencer released his C library that things started making sense -- you see, each of these utilities and mini-languages had their own idea and support of regexes. grep and awk, although they both used regexes, had different syntaxes. Spencer's regex package allowed people to use a standard syntax for regexes.

Perl used its own regex library until version 2, when it began to use a derivative of Spencer's creation. Since then, that derivative has grown in size and power, and Perl's regex engine is capable of so much more than was originally anticipated. Many other languages now use PCREs, which are ``Perl-compatible regular expressions'' -- a subset of Perl's syntax.

Different Regex Engines

As if having different syntaces for regexes wasn't enough, there are also two major classifications of the engine behind a regex: DFA, a deterministic finite automaton, and NFA, a non-deterministic finite automaton.

NFA

An NFA engine takes a regex and tries to find text that matches the correct ``node'' of the regex. An example is matching /an(na|ts|nual)/ against the text ``the annual income''. The first node the engine sees is /a/, and it matches the ``a'' in ``annual''. The next node is the /n/ and the ``n'' is matched. The next node is /(na|ts|nual)/ which means ``match 'na' or 'ts' or 'nual''', in that order. An NFA engine would try these options in that order.

So the engine tries match /n/ and then /a/, but fails because ``annu'' does not match /anna/. So the engine tries the next option, /ts/. This one fails immediately because there is no ``t''. So then the engine tries /nual/, and matches. In this manner, the engine moves from one node in the regex to another, so NFAs are ``regex-directed'' (a term used by Jeffrey Friedl that I find suitable).

DFA

A DFA engine is ``text-directed''. As the regex matches, the engine keeps track of all possible ``paths''. In our example above, the engine would be watching all three branches of the /(na|ts|nual)/ at the same time. When the second ``n'' matched, the state of the engine would show /an(na|ts|nual)/; the boldness indicates what parts of the regex have matched so far. Thus, the engine knows that it could be preparing to match an ``a'' or a ``u'' next -- and depending on what it matches, one of those ``paths'' will be stopped.

To a DFA, the regex /ab+c/ and the regex /[a](bb*c|b+c)/ are the same, because the text determines where the ``pointer'' in the regex is located.

Alternation

The biggest noticeable difference between the two engines (other than the blinding speed of a DFA) is the way alternation is dealt with.

An NFA will examine options of an alternation left-to-right; a DFA doesn't really care, since it ends up being in all matching options of an alternation at the same time. A DFA is really in more than one place at one time. What's more is that a DFA chooses the longest successful match it found earliest in the string, whereas an NFA chooses the first successful match. This is an important difference.

Matching /jeff|jeffrey/ against the text ``jeffrey'' in an NFA will only match ``jeff'' -- that option is successful, and the regex ends. A DFA would match ``jeffrey'', since it was the longer of the two. /jeffrey|jeff/ would match ``jeffrey'' in both an NFA and a DFA.

What about Perl?

Perl uses a hybrid NFA. It's not really ``regular'', as far as regular expressions are concerned. This is because there are a handful of ways of crafting the regex as it matches. Perl is definitely not a DFA, though. Perl is part NFA, and part bizarre -- things like back-references and dynamically-built regexes are not technically allowed, but Perl supports them.

The Future

What does the future hold for regexes? Well, the Unicode specifications allow for you to embed set logic into character classes. Why is this important? How is it useful? Consider the character class of all letters except vowels; in Perl, you'd have to write [b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]. That's annoying to look at, and more annoying to write. Come Perl 6, you'll be able to use the syntax [a-zA-Z^^[aeiouAEIOU]]. That might look noisy, but it's a bit easier to explain: ``the characters from 'a' to 'z' and 'A' to 'Z', except 'a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', and 'U'''. Consider too, all digits except 5: /[0-46-9]/, which can be [\d^^5].

You'll be able to calculate the intersections of character classes too. If you have two variables representing classes, and you want to match any character that they share, you can do [$this&&$that]. The fact that this can be done inside a regex (rather than constructing a separate character class manually from the two) makes you do less work, and leaves the work to Perl.

Perl has started supporting Unicode in regexes, and POSIX character classes too.

Perl 6 is going to have a slightly different regex syntax (as will be revealed in Larry Wall's 5th Apocolypse). The engine is being rewritten at the moment, and will probably have more features. One place of optimization is matching at the end of a string.

Sexeger

``Sexeger'' is ``regexes'' backwards, and it's also the name I've given my regex reversal project. There's an article on perl.com about it (just search for ``sexeger'') and I have some research on my web site about it too. The idea is that certain regexes, like /\s+$/, aren't optimized by Perl like we might expect them to be. /\s+$/ does not look for whitespace at the end of the string; instead, it looks for whitespace, and then checks to see if it's at the end of the string.

My project is to parse a regex into a structure, reverse the structure, and then recreate the regex. The regex will then match against the reversed string, and it should be much faster. The regex /\s+$/ reverses to /^\n?\s+/, which my parser will compress to /^\s+/ (since \n belongs to \s).

A useful application of this technique is matching the last occurrence of a pattern in a string. To match the last set of digits in a string, you need to use a regex like /(\d+)\D*$/. That has the down-side of trying to match at every block of digits in the string. If we reverse the string and reverse the regex to /^\D*(\d+)/, then the first match will be the one we want, and all that is left to do is reverse $1.

The root of this project is the Regexp::Parser module, which will be the parent module of Regexp::Reverse and a handful of other useful utilities.

The Distant Future

Regexes will always exist in some form, whether it be cryptic (/(?\t+)/>) or less cryptic (/<grab:<tab>+>/). The need for matching text will never disappear.

Resources

http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?regular+expression
A little history on regexes from the Free On-Line Dictionary Of Computing.

``Mastering Regular Expressions'', Jeffrey Friedl, O'Reilly, 1997
A lot of history on regexes, and information on DFA vs. NFA, from an incredible book by a regex master.

perlre, from the Perl standard documentation
Explanation of Perl's regex syntax.

Author

Jeff ``japhy'' Pinyan has been programming Perl for five years. He's an active member of the DALnet #perl channel, the http://www.perlmonks.org/ community, and the Perl5-Porters. He's working on a regex book for Manning, to be completed later this year.