PROBLEM

Parsing and analyzing a URI.


SYNOPSIS

You have a variable containing a URI (uniform resource identifier) and you want to take it apart and do something with it.

Input

  http://www.foo.com/alpha/bet.cgi?t=3#top

Output

  http
  www.foo.com
  /alpha/bet.cgi
  t=3
  top


EXPLANATION

RFCs 2396 and 2732 document and specify the format of a URI. We want to extract certain parts of a URI and examine them. A URI, in general, has the form "<scheme> //<authority> <path> ?<query> #<fragment>". The RFCs present us with a regex to use:

  m{
    ^
    (?: ( [^:/?#]+ ): )?   # scheme
    (?: // ( [^/?#]* ) )?  # authority
    ( [^?#]* )             # path
    (?: \? ( [^#]* ) )?    # query
    (?: \# (.*) )?         # fragment
  }x

This regex, in list context, returns five strings. Some might not exist (and be returned as undef) for some URIs. First, let's write a function for convenience:

  sub uri_parse {
    shift =~ m{
     ^
       (?: ( [^:/?#]+ ): )?   # scheme
       (?: // ( [^/?#]* ) )?  # authority
       ( [^?#]* )             # path
       (?: \? ( [^#]* ) )?    # query
       (?: \# (.*) )?         # fragment
    }x;
  }

This returns true or false in scalar context, and a list of five scalars in list context. Now we can work with a URI:

  my ($scheme, $auth, $path, $query, $frag) =
    uri_parse('http://www.foo.com/alpha/bet.cgi?t=3#top');

We also want a sister function to put the pieces back together.

  sub uri_create {
    my ($scheme, $auth, $path, $query, $frag) = @_;
    my $uri;
    $uri .= "$scheme:" if defined $scheme;
    $uri .= "//$auth"  if defined $auth;
    $uri .= $path      if defined $path;
    $uri .= "?$query"  if defined $query;
    $uri .= "#$frag"   if defined $frag;
    return $uri;
  }

Now we break a URI apart, change something, and put it back together in a safe manner. However, URIs are subject to certain escaping circumstances, and need to be handled far more safely. For simple cases, this approach (using the regex and functions) is adequate, but more complex cases (where escaping is required) need a far more robust solution.

Luckily, such a solution exists...


SEE ALSO

The URI module is superb for working with all sorts of URIs, and even making up your own. It is not a standard module, but is a suggested addition to your Perl library. It handles several schemes, and has many useful methods for extracting very specific pieces of a URI. For example, some FTP URIs include username and password information that our function doesn't extract as their own elements (ftp://user!pass@ftp.server.com/pub/README).

The URI module uses regexes based on the RFCs mentioned above.