PROBLEM

You have a C or C++ file and you want to extract or remove the comments.


SYNOPSIS

C files can contain comments between the two markers "/*" and "*/"; C++ files use that format and the "//" comment marker, which goes to the end of the line. We wish to extract the comments, and perhaps remove them from the programs.

Input

  // the following is not a comment...
  printf("/* comment */\n");
  /* but this " is
   * and it spans two lines */

Output

  [ "C++", " the following is not a comment..." ],
  [ "C", " but this \" is\n * and it spands two lines "],


EXPLANATION

Comments are often a helpful addition to programs, explaining what a certain block of code does. We've seen that regexes themselves can have comments embedded to allow readers to understand even the most daunting examples. In this case, we have a C (or C++) program, and wish to extract the comments for use somehow. One real-life case was posted on the comp.lang.perl.misc newsgroup, asking to replace comments with blank lines as necessary, to remove the comments but keep the file's line-count the same.

The task cannot merely be accomplished with a regex such as

  push @comments, $code =~ m{
    /\* .*? \*/  # /*...*/ C-style
    |            # or
    // [^\n]*    # //... C++-style
  }sgx;

This is because, as our sample input shows, C has quoting constructs, and a quoted string can contain "/*" or "//" in it, without starting a comment. We might be tempted to try a regex such as:

  while ($code =~ m{
    " .*? " | ' .*? '  # quoted strings (ignored)
    |                  # or
    (
      /\* .*? \*/      # C-style
      |                # or
      // [^\n]*        # C++-style
    )
  }sgx) {
    push @comments, $1 if $1;
  }

But this falls prey to a common mistake made while parsing quoted strings. A quoted string usually has backslashing rules associated with it. The string "you \"won\" this time" has two quotes escaped with a backslash. However, our regex above doesn't pay attention to backslashes, and merely matches to the nearest quote.

To match a quoted string with backslashes, we'll need a slightly more sophisticated regex.

  /"(?:[^"\\]+|\\.)*"/s
  # or
  /"[^"\\]*(?:\\.[^"\\]*)*"/s

The first matches any non-quotes and non-backslashes as possible, or a backslash followed by any character, as many times as possible. The second uses the technique of "unrolling the loop", and runs a bit faster; it matches any non-quotes and non-backslashes, and then tries to match a backslashed character, followed by non-quotes and non-backslashes, as many times as possible.

Now we have a regex for matching a quoted (and escape-containing) string; we will use it for double- and single-quoted strings. Because character classes must be known at the regex's compile-time, we can't do

  /(["'])[^\1\\]*.../

because the engine interprets \1 in a character class as an octal escape. Here are the two regexes we'll use for strings:

  my $dbl = qr/"[^"\\]*(?:\\.[^"\\]*)*"/s;
  my $sgl = qr/'[^'\\]*(?:\\.[^'\\]*)*"/s;

And here are the comment regexes:

  my $C   = qr{/\*.*?\*/}s;
  my $CPP = qr{//.*};
  my $com = qr{$C|$CPP};

We can also include a regex to match everything else (in moderation):

  my $keep = qr{.[^/"'\\]*}s;

This regex is useful, especially for large texts, for skipping over characters that aren't going to be matched by the other parts of the pattern. We'll group the quoting regexes and this one together as:

  my $keep = qr{$sgl|$dbl|$other};

Now we can do almost anything we want with a C or C++ file. The following regexes assume a C++ file, but if you want to use only C syntax, remove $CPP from $com.

  # remove comments
  $source =~ s/$com|($keep)/$1/go;
  # extract and remove comments
  $source =~ s{($com)|($keep)}{
    $1 ? push(@comments, $1) && "" : $2
  }ego;
  # extract comments
  $1 and push @comments, $1
    while $source =~ /($com)|$keep/go;
  # extract comments, too
  $source =~ m{
    ($com) (?{ push @comments, $1 })
    | $keep
  }gox;
  # extract comments:  [ type, comment ]
  while ($source =~ /($com)|$keep/go) {
    my $c = $1 or next;
    push @comments, [ $c =~ s!^//!! ?
      ("C++", $c),
      ("C", substr($c, 2, -2)),
    ];
  }
  # remove comments, preserve newlines
  $source =~ s{($com)|($keep)}{
    if ($1) { "\n" x $1 =~ tr/\n// }
    else { $2 }
  }ego;


SEE ALSO

The \K escape (available with Regexp::Keep) allows us to get rid of the "Use of uninitialized value" warnings that many of the regexes above give us, and could provide an increase in speed.