PROBLEM

Truncating a string to n words.


SYNOPSIS

You have a chunk of text (perhaps the description of a book) and you want to truncate it to a specified number of words, and append a "..." to the end. If there are less words (or the same number of words) than required, do not modify the string.

Input

"Object Oriented Perl", by Damian Conway, is a must-have for any serious Perl programmer interested in write object oriented code. It is thorough, humorous, and detailed.

This description is paradoxical; it is not ten words long.

This book rocks.

Output

"Object Oriented Perl", by Damian Conway, is a must-have for...

This description is paradoxical; it is not ten words long.

This book rocks.


EXPLANATION

The main task here is determining what defines a "word". Once you've done that, you can solve the problem in (at least) three ways: a regex that matches a pattern n times; split() the string on the non-word segments into the needed number of fragments; or substitute the rest of the string with "...".

For purposes of example, we will assume any chunk of non-whitespace is a "word"; this will include punctuation, but this is a suitable trade-off. In the code to follow, wherever you see \S, you should use a character class (or perhaps just a macro) to match whatever you call a "word", and wherever you see \s, you should invert that character class (or macro).

The split() approach is probably the more logical solution, so it is presented first.

  # assuming $str is the string
  # and $n is the number of words to present
  my @words = (split ' ', $str, $n+1)[0 .. $n-1];
  my $new_str = join ' ', @words;
  # all at once
  my $new_str = join ' ',
    (split ' ', $str, $n+1)[0 .. $n-1];

This approach is straight-forward; we split the string on whitespace, and join together the $n elements we want.

The second method uses a regex to extract the appropriate number of words (and the intermediate characters):

  my $m = $n - 1;
  my ($new_str) = $str =~ /(\S+(?:\s*\S+){0,$m})/;

You'll notice I didn't use

  my ($new_str) = $str =~ /(\S\s*){0,$n}/;

because the quantifier on a parenthesized sub-pattern only captures the last match of the sub-pattern. Another approach that I avoided was

  my ($new_str) = $str =~ /((?:\S\s*){0,$n})/;

because it includes the whitespace after the last word, which I personally would not want.

Using these two methods, once we have gotten our string, we merely append "..." to it. That leads us to our third solution, using substitution. It looks a lot like the previous method, since the left-hand of the s/// is going to match the words, and then some.

  my $m = $n - 1;
  ($new_str = $str) =~ s/(\S+(?:\s*\S+){0,$m}).*/$1.../s;

Here, I match the number of words, and store that in $1, and then let .* gobble up the rest of the string; the /s switch is included to allow . to match newlines. This regex, if it succeeds, ends up matching the entire string (except any leading whitespace, but that's not of importance). We then replace the match with $1 followed by the three periods.

Here, we have a chance to fine-tune the solution a bit. Consider a string of 10 words, where we want to truncate to 10 words. All these solutions will (quite mistakenly) append "..." to the non-truncated string. We need to make sure we don't modify the string if there's no need to.

The split() method is the most awkward of the three to correct; we'd have to check first to see how many words there are (or at least, see if there are more than the number we want to accept):

  my @words = (split ' ', $str, $n+1)[0 .. $n-1];
  my $new_str = (@words > $n) ? join ' ', @words : $str;

The other two solutions merely require "tweaking" to correct. The regexes just need to be altered slightly to see if there's anything in the string after the nth word:

  # with matching
  my $m = $n - 1;
  my $new_str = $str =~ /(\S+(?:\s*\S+){0,$m})./s ? $1 : $str;
  # with substitution
  my $m = $n - 1;
  (my $new_str = $str) =~ s/(\S+(?:\s*\S+){0,$m}).+/$1.../s;

With both regexes, we merely added a . to each regex. Since the s/// already ended in .*, we turned .*. into .+ to achieve the same effect.


SEE ALSO

Other approaches.