PROBLEM

Expanding links in a simple tag system.


SYNOPSIS

Many sites, in an effort to not allow raw HTML to be used, use their own tag system. One such site, http://www.perlmonks.org/, allows some HTML syntax, but uses square brackets ("[" and "]") to simplify the creation of links.

Some links have a specific protocol assigned to them ("[id://100]"), while others do not ("[japhy]"). In addition there can be text attached to a link ("[id://100|this article]", "[japhy|me]"). These are supposed to be rendered as HTML <A> tags.

Input

  [japhy|I]'ve got a [id://500|good article] about the [unlink] function.

Output

  <a href="/index.pl?node=japhy">I</a>'ve got a
  <a href="/index.pl?node_id=500">good article</a> about
  the <a href="/index.pl?node=unlink">unlink</a> function.


EXPLANATION

Our goal is to match a tag, which is from an opening bracket to the first closing bracket. We don't need to worry about nesting, because that concept simply doesn't exist here.

The definition of a "protocol" for the tag is word characters followed by a colon and two slashes, much like the "http://" and "ftp://" you are probably familiar with. If there is no protocol for a tag, we use a default link (the "/index.pl?node=" link); there is a protocol, we'll use a hash to determine what link to use (the "id" protocol uses "/index.pl?node_id=").

First, we build a hash of protocol-to-link:

  my %HOW = (
    id => "/index.pl?node_id=",
    http => "http://";,
    https => "https://",
    google => "http://www.google.com/search?q=";,
    # ...
    DEF => "/index.pl?node_id=",
  );

For readability's sake, let's construct a few regexes:

  my $proto = qr/\w+/;       # word chars
  my $content = qr/[^]|]+/;  # non-], non-|
  my $text = qr/[^]]+/;      # non-]

Now, we can do one substitution or two; if we do two, we need to look for tags with protocols first:

  $chunk =~ s{
    \[
      ($proto) :// ($content)
      (?: \| ($text) )?
    \]
  }{
    defined($3) ?
      qq{<a href="$HOW{$1}$2">$3</a>} :
      qq{<a href="$HOW{$1}$2">$2</a>};
  }egx;

The second regex takes care of plain tags:

  $chunk =~ s{
    \[
      ($content)
      (?: \| ($text) )?
    \]
  }{
    defined($2) ?
      qq{<a href="$HOW{DEFAULT}$1">$2</a>} :
      qq{<a href="$HOW{DEFAULT}$1">$1</a>};
  }egx;

The single regex approach is not terrible-looking, though.

  $chunk =~ s{
    \[
      (?: ($proto) :// )? ($content)
      (?: \| ($text) )?
    \]
  }{
    my $p = defined($1) ? $1 : "DEFAULT";
    defined($3) ?
      qq{<a href="$HOW{$p}$2">$3</a>} :
      qq{<a href="$HOW{$p}$2">$2</a>};
  }egx;

The benefit of using variables containing patterns is clearly seen here. We have used aptly named variables in place of confusing patterns.


SEE ALSO

Other approaches.


CREDIT

jeffa at http://www.perlmonks.org/