/^([^aeiou])[aeiou]\1\1$/
To the data-processing programmers among us, the terms ``CSV'', ``delimiter'', ``munge'', and ``massage'' are common and well-known. But the truly fortunate of these data-processors know the term ``regular expression'', which I lovingly (if convenience is grounds for love) abbreviate as ``regex''. A regex is a programming language unto itself -- it has input (some string to execute upon), the source (the physical regex), and output (stating whether the regex matched, and perhaps what text was matched).
Many languages offer regex support... some better than others. But no matter
the support, regexes offer us a hyper-reductionalist approach to matching text
that conforms to some pattern. That's what the language of regexes is for
-- they apply a pattern to a piece of data and see what results. Regexes are
a big function: ``does this data conform to this rule?'' The benefit of regexes
is that they are small and concise; anyone who has tried parsing text files
with C knows the effort involved: countless lines of code, special cases,
loops, switch statements, backtracking to correct errors, keeping track of
pointers to strings... it's a headache. Worse, it's a migraine.
Regexes are a headache, but they're a much less severe kind. Regexes are a headache because they look like gibberish, but they get the job done with much less effort. Consider this seemingly simple task:
Parse a string whose fields are separated by commas. However, commas can appear in quoted strings where they are not seen as field separators. A quote can be embedded in a quoted string by doubling it.
That's one possible definition of a list of comma-separated values, or CSV for short. (It's important not to confuse CSV, which is a common data storage method, with CVS, the drugstore down the street. You don't need me to tell you hard it is to match CVS with a regex.) The task of parsing CSV with a language like C is not simple, because of the special cases. If you use a regex (well, two, actually) you can get the job done quickly and easily. The most important feature of regexes is that once you know how to write them, you can write them in a mere fraction of the time it would take to create the huge parsing engine that they stand for.
# Perl code for parsing CSV (as per the above definition)
$string = 'this,is,"good ""Perl"" code",for,"parsing CSV"';
@fields = $string =~ /(?:^|,)("(?:[^"]+|"")*"|[^,]*)/g;
for (@fields) {
if (s/^"//) { s/"$//; s/""/"/g; }
}
# @fields =
# ('this', 'is', 'good "Perl" code', 'for', 'parsing CSV')
If the code above might seem confusing, it's only because you haven't been initiated into the Order of the Regex. The Rites and Rituals to follow will help you enter into this mystical society. If you wish to parse your data in an easier way, and spread understanding of the Great Spencerian Language, then by all means, read on.
/^([^aeiou])[aeiou]\1\1$/A regex is a program. A really tiny program. Its ``functions'' are one to three characters in length. Add to that the provision that alphanumeric characters are (more or less) off-limits, and you have ``line noise in a can''. The task of understanding a regex requires you to know what symbols mean what, and where they mean it.
By the way, the regex in the header for this section matches strings like ``sass'', ``5e55'', ``lull'', and ``#o##''.
In Perl, a regex is commonly bounded by the / characters: m/pattern/
denotes a pattern match. The m is optional, and I'll be omitting it. To
match a pattern against a string, you use the =~ operator:
if ($text =~ /pattern/) {
# the text was matched by the pattern
}
To invert the test (meaning, to see if the text is not matched by the pattern)
you can use the !~ operator.
If you are evaluating a regex on $_, you can leave off the $_ =~ part of
the expression:
if ($_ =~ /pattern/) { ... }
# is synonymous to
if (/pattern/) { ... }
The simplest symbols in a regex are plain-text characters; the alphanumerics
and the underscore. These match themselves, plain and simple. The regex
/jeff/ will try to find the string ``jeff'' in whatever input it is given;
the regex engine looks for a 'j' followed by an 'e' followed by an 'f' followed
by another 'f'. If that can be done, the pattern has matched successfully; if
it cannot be done, the pattern has failed.
print "Enter your name: "; chomp($name = <STDIN>);
if ($name =~ /J/) {
print "Your name has a capital J in it!\n";
}
That simple program will print an obvious statement when the user enters text
like ``Jeff'' or ``James'' or ``Abdul Jamal''; no message will be given for text like
``David'' or ``Becky'' or ``jar jar binks''. (No self-respecting program would ever
warrant ``jar jar binks'' with a response.) As you can tell, regexes are case
sensitive -- the pattern /J/ matches a capital ``J'' and not a lowercase
``j''.
Now we will see the mystic runes that mean more than they seem to be; they are called metacharacters. There are generally 13 of them:
| [ . * + ? { } \ ^ $ ( )
The \ metacharacter is the easiest to explain. Use it to match any of the
actual metacharacters; \* matches a literal ``*'', and \\ matches a
literal ``\''.
It'd be a real shame to have to use more than one regex to see if the name entered has a ``J'' or a ``j'' in it.
# find a uppercase or lowercase "j"
if ($name =~ /J/ or $name =~ /j/) { ... }
It's important to have a means of specifying options for a regex to take in the
course of matching text. That is why alternation exists, by way of the |
character (vertical-bar, pipe, whatever you want to call it). | is read as
``or'' to a regex. /J|j/ means ``match a 'J' or a 'j'''. You can use more
than one alternation in a regex:
# find a uppercase or lowercase "j"
if ($name =~ /J|j/) { ... }
if ($name =~ /Jeff|James|Abdul Jamal/) {
print "Excellent choice for a name, sir.\n";
}
It's important to see how ``weak'' the | metacharacter is. The regex
/Jeff|James/ does not mean ``match Jeffames or JefJames'' -- the | isn't a
choice between the ``f'' or the ``J''; rather, it is a choice between everything
else, a choice between the ``Jeff'' or the ``James''.
If you do want to have a choice of one of a set of characters, there is another
option: character classes. These are sets of characters found between [
and ]. As an example, the character class [aeiou] matches any of those
five characters -- lowercase vowels. Ranges of characters can be abbreviated
with a hyphen, so [a-z] is shorthand for ``all the lowercase consonants''. A
class like [A-Za-z0-9] matches any alphabetic character or digit. Matching
a literal ``['', ``]'', or ``-'' in a character class usually involves backslashing
it -- putting a backslash (``\'') in front of it. The class [\[\]] matches a
``['' or a ``]''.
# find a uppercase or lowercase "j"
if ($name =~ /[Jj]/) { ... }
# matches "Jeffames" or "JefJames"
if ($name =~ /Jef[fJ]ames/) { ... }
You can invert the sense of a character class by placing ^ as the first
character in it. This means ``not any of ...''; the class [^aeiou] would
match any character (not just any letter, mind you) except ``a'', ``e'', ``i'', ``o'',
or ``u''.
# matches, Jaff, Jiff, J!ff, JJff, etc.
if ($name =~ /J[^e]ff/) {
print "What a bizarre name you have.\n";
}
Every now and then, you'll want your regex to match a character -- any
character, really. Well, any character except newline. Anyway, Perl provides
the . metacharacter as a shortcut for the character class [^\n]. It will
match any non-newline character.
# "J", something, "f", "f"
if ($name =~ /J.ff/) { ... }
It would be an even greater shame if you couldn't write a regex to match four
letters in a row without having to say /[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]/.
That is why quantifiers exist -- they allow you to express repetition of a
symbol in your regex. /[a-zA-Z]{4}/ matches four letters in a row.
There are four quantifiers: *, which matches zero or more; +, which
matches one or more; ?, which matches zero or one; and {n,m}, which is
used for more specific ranges. Quantifiers proceed the item the quantify --
/a*/ matches zero or more ``a''s. They bind very tightly (as compared to
| which is very weak); /ab*c/ matches an ``a'' followed by zero or more
``b''s followed by a ``c''.
* quantifier often causes confusion to beginners -- it can successfully
match by matching zero times. That means that "Jeff" =~ /a*/ returns
true; ``Jeff'' contains at least zero consecutive ``a''s (there are zero ``a''s at
the front of the string, right before the ``J''). Even more surprising, though,
is that "Barry" =~ /a*/ returns true, but only because there are zero ``a''s
at the front of that string! Since /a*/ has the ability to match zero
times, it'll do it as soon as possible, which means as early in the string as
possible. The regex /a*r+/, on the other hand, matches successfully at the
``arr'' in ``Barry''. But beware, /a*r*/ would match at the beginning of both
strings -- this is the trickery of the * quantifier.
+ quantifier is probably less of a nuisance, because it requires at least
one occurrence. "Jeff" =~ /a+/ fails, because there is not at least one
consecutive ``a'' in ``Jeff''. "Barry" =~ /a+/ succeeds as one would expect.
? quantifier allows something to be matched, but optionally. To match
both ``boat'' and ``bat'', for example, you can use the regex /bo?at/.
{n,m} quantifier is certainly the most versatile -- it comes in three
forms. First, {n} matches exactly n times; /jef{2}/ matches ``jeff''.
Second, {n,} matches at least n times; /jef{2,}/ matches ``jeff'',
``jefff'', ``jeffffff'', etc. Third, {n,m} matches at least n times, and no
more than m times; /jef{2,4}/ matches ``jeff'', ``jefff'', and ``jefff''.
An interesting thing to note about the { and } metacharacters is that
they're only metacharacters when they themselves match a certain pattern!
The regex /{a}/ is not a syntax error -- it matches the text ``{a}''. In the
same way, /x{1,2,3}/ matches ``x{1,2,3}''. Perhaps { and } are really
metametacharacters.
Quantifiers are nifty, and so is alternation, but we have a bit of a problem.
How can we express ``match 'Jeffrey' or 'Jeff' or 'James''' without writing it
all out? Sure, /Jeffrey|Jeff|James/ works, but we really want to say that
the ``rey'' part is optional. We can't just put a ? after the ``y'', because
then only the ``y'' is optional; and we can't put a ? after the ``r'', ``e'', and
``y'', because then they are only separately optional. We don't want to match
``Jeffry'' or ``Jeffe''.
For this reason, grouping parentheses allow us to partition off a set of
symbols as a sub-pattern, to be treated as one big symbol. This allows us to
place a quantifier on a group of symbols: /Jeff(rey)?|James/. We could
even take this a step further and do /J(eff(rey)?|ames)/.
In our early example about looking for the letter ``J'' in a string, we were matching the ``J'' anywhere in the string -- n'importe où, as the French would say (if they say that sort of thing). But what if we wanted to ensure the string started with the letter ``J''? Or ended with the letter ``f''? For that, we would need to use anchors.
Just like a real anchor does for a ship, regex anchors ensure the location of a
where a regex can match in a string. The two anchors you'll see most often are
^ and $, which match at the beginning and end of a string (respectively).
The regex /^J/ will match any string that starts with ``J'', such as ``Jeff''
or ``James'', but not ``Abdul Jamal''. Likewise, /f$/ matches any string that
ends with ``f'', such as ``Jeff'' or ``Biff'', but not ``Jeffrey'' or ``Buffy''.
Both anchors can be used together -- I can use /^Jeff$/ to make sure the
string entered was ``Jeff''.
Caveat regexor! The $ anchor means a bit more than just ``end of string'';
it also matches right before a newline at the end of a string. That means
that both ``Jeff'' and ``Jeff\n'' are matched by /^Jeff$/, but not ``Jeff\n\n''.
Perl's regex syntax offers convenient shortcuts for common character classes.
Digits, matched by [0-9], can also be matched by \d. Word characters,
which include alphanumerics, digits, and the underscore (``_''), can be matched
by \w. This has a hidden convenience, in that characters in your local
alphabet (such as ù in France) are matched by \w, while they are
not matched by [a-z]. All forms of whitespace, \n (newline), \t
(tab), \r (carriage return), \f (form feed), and a literal space, are
matched by the \s shortcut.
# see if a phone number looks good (no area code, though)
if ($phone_number =~ /^\d{3}-\d{4}$/) { ... }
# a phone number with area code!
if ($phone_number =~ /^\d{3}-\d{3}-\d{4}$/) { ... }
# a phone number with area code (another way)
if ($phone_number =~ /^(\d{3}-){2}\d{4}$/) { ... }
These shortcuts can all be inverted (as one would invert a character class with
a ^) by capitalizing the letter; \D matches non-digits, and so on.
The parentheses used to group sections of a regex do more than just that, they
also capture the text matched by the sub-pattern. This text is stored in
variables $1, $2, $3, etc., where each variable $N represents the
Nth set of parentheses.
if ($phone_number =~ /^(\d{3})-(\d{3})-(\d{4})$/) {
print "Area code: $1\n";
print "Exchange: $2\n";
print "The rest: $3\n";
}
Each successfuly regex sets these variables; a failed regex doesn't change their values.
"Jeff" =~ /\w(\w)/; # sets $1 to "e" "japhy" =~ /\d(\d)/; # $1 stays "e"
In addition to the $1 variable, the contents of what a parenthesized group
has matched is available inside the regex as \1, \2, etc. This allows
you to match things like doubled letters:
if ("beet" =~ /(\w)\1/) {
print "The letter '$1' is doubled\n"; # 'e'
}
This lets you build the regex as you go along (something a regular expression shouldn't have the ability to do). You can match words that have been repeated (by mistake, perhaps):
if ("paris in the the spring" =~ /(\w+)\W+\1/) {
print "The word '$1' was doubled.\n"; # 'the'
}
The return value of a regex depends on the context the regex was evaluated in:
scalar context will return a true (1) or false ('') value, depending on whether
the regex succeeded or not; list context will return the values of $1, $2,
etc. variables. This is particularly useful in changing these lines:
$phone_number =~ /^(\d{3})-(\d{3})-(\d{4})$/;
($area, $exch, $rest) = ($1, $2, $3);
into:
($area, $exch, $rest) =
$phone_number =~ /^(\d{3})-(\d{3})-(\d{4})$/;
This has a major advantage: if the regex does fail, the three variables
will be set to undef each, since a regex that fails returns an empty list,
in list context. Using the two-statement version, those three variables might
get older values of $1, $2, and $3.
In addition to the simple (...), Perl also offers the funnier-looking
(?:...) parentheses. These act like (...), but without the ``capturing''
effect.
if ("Jeffrey" =~ /(?:\w\w)+(\w)/) {
print "Last 'odd' letter: $1\n";
}
The quantifiers in Perl act greedily -- they want to match as much as they
can. At the same time, Perl wants to match as early in the string as possible.
This is why "Jeff" =~ /e*/ matches the zero ``e''s at the beginning of the
string, rather than the actual ``e'' in the string. However, a regex like
"Jeff" =~ /J*/ will match the one ``J'' at the beginning of the string.
It is important to know that ``greed'' does not mean that the engine will look
for the longest possible match (as can be seen by the above example). If you
try to extract the longest string of ``a''s from a string with $str =~ /(a+)/,
you'll find that Perl returns only the first selection that matches, not
necessarily the longest. (To do this task requires using the /g modifier,
and modifiers aren't covered by this article.)
Quantifiers can be made non-greedy by placing a ? after the quantifier. A
non-greedy quantifier attempts to match as little as possible before matching
more. Here is a comparison between greed and non-greed:
"dandelion" =~ /(.*)n/; # $1 is 'dandelio' "dandelion" =~ /(.*?)n/; # $1 is 'da'
A common misconception about greed is its interaction with the $ anchor.
"dandelion" =~ /d(.*?)$/; # $1 is 'andelion'
Many beginners expect $1 to be ``elion'', because ``elion'' is shorter than
``andelion''. But greed doesn't change where the engine starts looking, it
changes how much the engine tries matching. * by default tries to match
an infinite number of times, as does +; {3,6} tries matching 6 times.
When these are non-greedy, * matches 0 times at first, + matches 1 time
at first, and {3,6} matches 3 times at first.
"abacus" =~ /(\w{3,6})/; # $1 is 'abacus'
"abacus" =~ /(\w{3,6})?/; # $1 is 'aba'
Perl's regex engine boasts the most features of any regex flavor. There are
dozens of assertions, such as a ``look-ahead'' ((?=pattern)) that allows
you to match part of a regex without actually advancing in the string, and
arbitrary code evaluation ((?{ code })) that lets you execute any block of
code at a given point in a regex. These are beyond the scope of this article,
but the Perl documentation has plenty of details in perlre.
We now have pretty much every tool necessary to write a real-application regex. The topic at hand is parsing mail headers, in order to filter out spam. The first thing we need to understand is the format of an email. There are two sections, the headers and the body, and they are separated by a blank line (a line consisting solely of a newline). The header lines look like ``Header: value''. We're interested in storing all these headers in a data structure (a hash is most suitable for this) and then checking particular headers for key phrases, like ``FREE'' or ``!!!!'' or ``STRONG BUY'' (I've been getting a rash of suggested stock market purchases lately).
(This code does not support duplicated headers (such as ``Received:'') or headers that span multiple lines.)
# assuming the message is coming in on standard input
my %headers;
while (<STDIN>) {
# /^$/ matches an empty line
# which signals the end of headers
last if /^$/;
# extract the type of header and its value
# or else go to the next line
my ($type, $value) = /([^:]+): +(.*)/
or next;
$headers{$type} = $value;
}
Now that our hash is formed, we can check certain fields for key phrases:
if ($headers{From} =~ /\@hotmail\.com/) {
# handle mail from hotmail.com
}
if ($headers{Subject} =~ /FREE|!{2,}|STRONG BUY/) {
# "FREE", "STRONG BUY", or 2 or more !'s is a red flag
}
That's a bare-bones email-header parser. There are far more robust solutions out there, but this is an example of a practical application of regexes.
Well, ok, the heading is a lie. Regexes aren't magic -- they're scientific at heart. From the first principles they were founded on, to the theory behind regex engines, math and science are the building blocks of the regex. Regex theory was developed several decades ago, but one man, Henry Spencer, put the theory into a distributed package, allowing programmers everywhere to use a standard set of regex syntax.
The name ``Henry Spencer'' should be as revered for users of regexes as the name ``Larry Wall'' is for Perl programmers, and ``Brian Kernighan'' and ``Dennis Ritchie'' are for C programmers. Although Spencer was the creator of the first distributed regex library, the theory behind regexes was formalized in the 1950s by Stephen Kleene. Kleene first used the term ``regular expressions'', and developed formal definitions of finite state machines (which are the heart of regex engines).
Regexes became a commonplace tool in the Unix environment early in Unix's
lifetime. Early text editors provided them as a means of finding text and
making changes to text without having to search through the text manually.
Utilities such as grep, sed, and awk appeared in the 1970s, and have
remained in distributions. Their names are as baffling as the syntax of
regexes themselves. grep was derived from a command from ed,
g/regular expression/p, which means ``global regular expression print'', and
displays the lines of a file that match a given regex. sed is an acronym
for ``stream editor''. awk got its name from the initials of the last names
of its authors, Alfred Aho, Peter Weinberger, and Brian Kernighan.
It wasn't until 1986 that Spencer released his C library that things started making sense -- you see, each of these utilities and mini-languages had their own idea and support of regexes. grep and awk, although they both used regexes, had different syntaxes. Spencer's regex package allowed people to use a standard syntax for regexes.
Perl used its own regex library until version 2, when it began to use a derivative of Spencer's creation. Since then, that derivative has grown in size and power, and Perl's regex engine is capable of so much more than was originally anticipated. Many other languages now use PCREs, which are ``Perl-compatible regular expressions'' -- a subset of Perl's syntax.
As if having different syntaces for regexes wasn't enough, there are also two major classifications of the engine behind a regex: DFA, a deterministic finite automaton, and NFA, a non-deterministic finite automaton.
An NFA engine takes a regex and tries to find text that matches the correct
``node'' of the regex. An example is matching /an(na|ts|nual)/ against the
text ``the annual income''. The first node the engine sees is /a/, and it
matches the ``a'' in ``annual''. The next node is the /n/ and the ``n'' is
matched. The next node is /(na|ts|nual)/ which means ``match 'na' or 'ts'
or 'nual''', in that order. An NFA engine would try these options in that
order.
So the engine tries match /n/ and then /a/, but fails because ``annu'' does
not match /anna/. So the engine tries the next option, /ts/. This one
fails immediately because there is no ``t''. So then the engine tries /nual/,
and matches. In this manner, the engine moves from one node in the regex to
another, so NFAs are ``regex-directed'' (a term used by Jeffrey Friedl that I
find suitable).
A DFA engine is ``text-directed''. As the regex matches, the engine keeps track
of all possible ``paths''. In our example above, the engine would be watching
all three branches of the /(na|ts|nual)/ at the same time. When the second
``n'' matched, the state of the engine would show
/an(na|ts|nual)/; the boldness indicates what
parts of the regex have matched so far. Thus, the engine knows that it could
be preparing to match an ``a'' or a ``u'' next -- and depending on what it matches,
one of those ``paths'' will be stopped.
To a DFA, the regex /ab+c/ and the regex /[a](bb*c|b+c)/ are the same,
because the text determines where the ``pointer'' in the regex is located.
The biggest noticeable difference between the two engines (other than the blinding speed of a DFA) is the way alternation is dealt with.
An NFA will examine options of an alternation left-to-right; a DFA doesn't really care, since it ends up being in all matching options of an alternation at the same time. A DFA is really in more than one place at one time. What's more is that a DFA chooses the longest successful match it found earliest in the string, whereas an NFA chooses the first successful match. This is an important difference.
Matching /jeff|jeffrey/ against the text ``jeffrey'' in an NFA will only
match ``jeff'' -- that option is successful, and the regex ends. A DFA would
match ``jeffrey'', since it was the longer of the two. /jeffrey|jeff/ would
match ``jeffrey'' in both an NFA and a DFA.
Perl uses a hybrid NFA. It's not really ``regular'', as far as regular expressions are concerned. This is because there are a handful of ways of crafting the regex as it matches. Perl is definitely not a DFA, though. Perl is part NFA, and part bizarre -- things like back-references and dynamically-built regexes are not technically allowed, but Perl supports them.
What does the future hold for regexes? Well, the Unicode specifications allow
for you to embed set logic into character classes. Why is this important? How
is it useful? Consider the character class of all letters except vowels; in
Perl, you'd have to write [b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]. That's annoying
to look at, and more annoying to write. Come Perl 6, you'll be able to use the
syntax [a-zA-Z^^[aeiouAEIOU]]. That might look noisy, but it's a bit easier
to explain: ``the characters from 'a' to 'z' and 'A' to 'Z', except 'a', 'e',
'i', 'o', 'u', 'A', 'E', 'I', 'O', and 'U'''. Consider too, all digits except
5: /[0-46-9]/, which can be [\d^^5].
You'll be able to calculate the intersections of character classes too. If
you have two variables representing classes, and you want to match any character
that they share, you can do [$this&&$that]. The fact that this can be done
inside a regex (rather than constructing a separate character class manually
from the two) makes you do less work, and leaves the work to Perl.
Perl has started supporting Unicode in regexes, and POSIX character classes too.
Perl 6 is going to have a slightly different regex syntax (as will be revealed in Larry Wall's 5th Apocolypse). The engine is being rewritten at the moment, and will probably have more features. One place of optimization is matching at the end of a string.
``Sexeger'' is ``regexes'' backwards, and it's also the name I've given my regex
reversal project. There's an article on perl.com about it (just search for
``sexeger'') and I have some research on my web site about it too. The idea is
that certain regexes, like /\s+$/, aren't optimized by Perl like we might
expect them to be. /\s+$/ does not look for whitespace at the end of the
string; instead, it looks for whitespace, and then checks to see if it's at
the end of the string.
My project is to parse a regex into a structure, reverse the structure, and then
recreate the regex. The regex will then match against the reversed string, and
it should be much faster. The regex /\s+$/ reverses to /^\n?\s+/, which
my parser will compress to /^\s+/ (since \n belongs to \s).
A useful application of this technique is matching the last occurrence of a
pattern in a string. To match the last set of digits in a string, you need to
use a regex like /(\d+)\D*$/. That has the down-side of trying to match at
every block of digits in the string. If we reverse the string and reverse the
regex to /^\D*(\d+)/, then the first match will be the one we want, and
all that is left to do is reverse $1.
The root of this project is the Regexp::Parser module, which will be the
parent module of Regexp::Reverse and a handful of other useful utilities.
Regexes will always exist in some form, whether it be cryptic (/(?\t+)/>) or
less cryptic (/<grab:<tab>+>/). The need for matching text
will never disappear.
Jeff ``japhy'' Pinyan has been programming Perl for five years. He's an active
member of the DALnet #perl channel, the http://www.perlmonks.org/ community,
and the Perl5-Porters. He's working on a regex book for Manning, to be
completed later this year.