Just Another Perl Article

By japhy for the West Yorkshire PUG

Getting a Handle on Files

"Open, Sesame!"

If you've used Perl for a week, you're probably familiar with the task of opening a file, either to read from or write to it. Here's a simple refresher course for you -- some of it involves Perl 5.6, which lets you do some nifty things with open(). There are three basic operations you use a filehandle for: reading, writing, and appending. You can also read and write (or read and append) to files, and you can read from to write to a program (from its output, or to its input).
# error-checking would, of course, be used

open FILE, "filename";      # read
open FILE, "< filename";    # read (explicit)
open FILE, "> filename";    # overwrite
open FILE, ">> filename";   # append
open FILE, "+< filename";   # read and write
open FILE, "+> filename";   # read and overwrite (clobber first)
open FILE, "+>> filename";  # read and append
open FILE, "program |";     # read from program
open FILE, "| program";     # write to program
For safety's sake, the explicit forms should always be used, and with a space between the mode and the filename. Here's an example of why:
chomp(my $filename = <STDIN>);
open FILE, $filename;
This allows the user pass anything from "< /etc/passwd" to "rm -rf / |" to your open() call, neither of which you'd be too happy to permit. For the same reason, using open(F, ">$filename") isn't enough either -- the user could slip an extra > in on you and cause you to append, rather than overwrite.

Perl 5.6 allows an even greater extent of control: a multi-argument form of open():
# open FILEHANDLE, MODE, EXPR

open FILE, "<", $filename;  # read from $filename
If you want to pipe to a program, the MODE should be "|-"; if you want to pipe from a program, the MODE should be "-|". In the case of call programs, you can send a list of arguments after the program name:
# open FILEHANDLE, MODE, EXPR, LIST

open LS, "-|", "ls", "-R";
That invokes ls with the -R switch (for recursive listing), and returns the output to Perl.

Finally, Perl 5.6 allows you to use an undefined lexical (a my variable) in the place of the filehandle. This allows you to use filehandles as variables more easily -- using them in objects, passing them to functions, etc.
for my $f (@listing) {
  open my($fh), "<", $f;
  push @files, $fh;
}

Obfuscorner

If you only send a filehandle to open(), Perl will look for a package variable (not a lexical) of the same name, and use the value of that variable as the filename to open. A simple use of this is to open the program itself; since $0 holds the name of the program, you can simply write:
open 0;  # like:  open 0, $0

Whose Line Is It, Anyway?

Files are not made up of lines. Files are made up of sequential bytes. A "line" is a made-up concept which only applies to text files (who cares how many "lines" there are in a JPEG?). The standard definition of a line is a sequence of zero or more bytes ending with a newline. Whether that is \n or \r\n or \n\r is up to your OS to decide. But who cares about "lines"? Perl is more interested in records.

A record is a sequence of bytes separated from other records by some other sequence of bytes. A "line" is merely a record with a separator \n (or whatever). What good are records, though, if Perl keeps reading lines? Well, just tell Perl not to read a line!
open FORTUNE, "< /usr/share/games/fortunes/art";
{
  local $/ = "\n%\n";
  @fortunes = <FORTUNE>;
}
close FORTUNE;
This code makes use of the $/ variable -- the "input record separator" -- to change how much each read of <FORTUNE> does. Instead of stopping at "\n", it stops at "\n%\n" (the separator of my computer's fortune files). This means that we can read multiple "lines" at once. In fact, Perl has two special values of $/ explicitly for that purpose: In addition to the record-separator use of $/, you can set it to a reference to a positive integer, which means that you will read that many bytes at on each read:
while (read(FILE, $buf, 1024)) { ... }

# is like

{
  local $/ = \1024;
  while ($buf = <FILE>) { ... }
}
If you're wondering why I continually local()ize $/, it is to make sure that the change to $/ are restricted to where we want it. We don't want future filehandle-reads to be using the changed value.

The $/ variable is also used by chomp() -- this function doesn't just remove a newline from the end of its arguments, it removes the value of $/ from the end of them (if it's there).

Outputting Records

There are a couple of variables related to printing records as well. The $\ variable (the output record separator) and the $, variable (the output field separator). The mnemonics for these two are rather simple: The fact that $\ and $/ share a mirrored character is not a mistake either -- they are related in that each is the other's opposite.

How are they useful? They let you be obscenely lazy. Let's say you're playing with the /etc/passwd file:
open PASSWD, "/etc/passwd"
  or die "can't read /etc/passwd: $!";
open MOD, "> /etc/weirdpasswd"
  or die "can't write to /etc/weirdpasswd: $!";

$\ = $/;   # ORS = IRS = "\n"
$, = ":";  # OFS = ","

while (<PASSWD>) {
  chomp;  # removes $/ from $_
  my @f = split $,;  # splits $_ on occurrences of $,
  # fool around with @f
  print MOD @f;
}

close MOD;
close PASSWD;
If we hadn't set $\ and $, in this code, the output file would have been one long line of fields, with nothing in between each field, and no way to separate one record from the next. However, since we have set them, we automatically append $\ to each print() statement, and automatically insert $, in between each argument to print(). Here's the explicit code that doesn't use these two variables:
while (<PASSWD>) {
  chomp;
  my @f = split ':';
  # fool around with @f
  print MOD join(':', @f), "\n";
}
While that may end up being more clear than the other, it's only that way because you've not been exposed to the variables. I'm sure before you learned how to use $_, your code was a lot more verbose; but once you embrace that default variable, code like
for my $line (@lines) {
  chomp $line;
  my @fields = split /=/, $line;
  for my $f (@fields) { $f =~ s/->/: /; }
  # ...
}
became code like
for (@lines) {
  chomp;
  my @fields = split /=/;
  for (@fields) { s/->/:/ }
  # ...
}
It's the same with these other variables.

While We're Being Lazy...

There's no variable that symbolizes the default filehandle to print to -- if you print() with no filehandle mentioned, Perl assumes you mean to print to STDOUT.

Well, not necessarily. The default output handle can be changed. Its default value is STDOUT, but you can change that with the select() function:
print "to stdout\n";
my $oldfh = select MOD;
print "to mod\n";
select $oldfh;
print "to stdout\n";
Assuming you start out with STDOUT as your default output handle, the code runs as is described. The select() function (in the single argument form) takes a filehandle, sets it as the default, and returns the previously select()ed filehandle.

You can call select() with no arguments, and it will merely return the current default filehandle (as an information source).

Huffering, Puffering, and Buffering

Another useful filehandle variable is $| the autoflush variable. This variable is unique for each filehandle -- output to STDERR is flushed automatically, but output to STDOUT is not. This variable is a true boolean -- it either holds a true value (which gets stored as 1) or a false value (which gets stored as 0).

Buffering is the process of storing output until a certain condition is reached (such as a newline is encountered). When a buffer is flushed, its contents are emptied. Where do they go? Well, to the filehandle proper. A buffer is a temporary holding location between the process generating the output and the place the output will appear.

Like I said, each filehandle has its own buffer control. To set the autoflush variable for a given filehandle, you have to use select(), or the standard IO::Handle module's autoflush method.
# turn on autoflushing for OUT
{
  my $old = select OUT;
  $| = 1;
  select $old;
}

# another way, using IO::Handle
use IO::Handle;
autoflush OUT 1;
The IO::Handle module offers many helpful methods for filehandles (which are internally objects of the IO::Handle class). You might want to see what else it has to offer that you might want to use.

You can make your own per-filehandle variables via the Tie::PerFH module, available on CPAN.

Obfuscorner

In the evil Perl spirit of "there's more than one way to do it", there's an obfuscated way to turn on autoflushing for a filehandle. It combines the three lines (save the old handle, set $|, restore the old handle) into one:
select((select(OUT), $|=1)[0]);
The dissection of this code is as follows:
  1. select(OUT) makes OUT the default handle and returns the previous handle
  2. $| = 1 sets autoflush to true, after the select(OUT) has been executed
  3. (select(OUT), $|=1)[0] is a list slice -- it takes the first element of the list (select(OUT), $|=1), which is the value returned by select(OUT) (the previous filehandle)
  4. select(...) makes that value the default filehandle -- and what is ...? it's the first element of the list (described above)
Delightfully icky!

Another trick is to take advantage of the fact $| is always either 0 or 1. If it's 0, and you subtract 1, -1 is transformed into 1. Subtracting 1 again gives you 0 again. Thus, $|-- is a builtin flip-flop!
# alternate indenting and not indenting lines
for (@data) {
  print "  " x $|--;
  print "$_\n";
}
This doesn't work with $|++... can you see why?

The Magic of <>

The final mystery revealed is a lengthy one. We all know we can read input via <STDIN>. But what about the mysterious empty diamond operator, <>? What does it do, and how can we interact with its magic?

The empty diamond operator is related to @ARGV, $ARGV, the ARGV filehandle, the ARGVOUT filehandle, and $^I. You probably know one of these (@ARGV) already. The others will soon be made clear. First here's a sample program:
#!/usr/bin/perl -w

# inplace.pl ext code [files]
# ex: inplace.pl .bak '$_ = "" if /^#/' *.pl

use strict;

$^I = shift;
my $code = shift;

while (<>) {
  eval $code;
  print;
}
All the following symbols are strict-safe. Knowing this, our code can be written rather explicitly. You're about to see why Perl is so nice to you.
#!/usr/bin/perl -w

use strict;

my $ext = shift;
my $code = shift;

@ARGV = '-' unless @ARGV;

FILE:
while (defined($ARGV = shift)) {
  my $backup;

  # if we're not working with STDIN...
  if ($ARGV ne '-') {
    # get backup filename
    if ($ext =~ /\*/) { ($backup = $ext) =~ s/\*/$ARGV/ }
    else { $backup = "$ARGV$ext" }

    # try renaming file
    rename $ARGV => $backup or
      warn("Can't rename $ARGV to $backup: $!, skipping file.") and
      next FILE;
  }

  # with STDIN, there's no real backup done
  else { $backup = '-' }

  open ARGV, "< $backup" or
    warn("Can't open $backup: $!") and
    next FILE;

  # if we're not dealing with STDIN,
  # but $backup is $ARGV, we're doing real
  # in-place editing, so we use a Unix trick:
  #   * open the file for reading
  #   * unlink it
  #   * open the file for writing
  # this is a miracle, but it fails in DOS :(

  if ($backup ne '-' and $backup eq $ARGV) {
    unlink $backup or
      warn("Can't remove $backup: $!, skipping file.") and
      next FILE;
  }

  open ARGVOUT, "> $ARGV" or
    warn("(panic) Can't write $ARGV: $!, skipping file.") and
    next FILE;

  while (<ARGV>) {
    eval $code;
    print ARGVOUT;
  }

  close ARGVOUT;
  # note: we don't close ARGV!
}
Aren't you glad Perl does all that hard work for you?

Now that you know about these symbols, you can use some of them to your advantage. Here's a bit of code that prints each line of input with the source and the line number in front of it. Notice, though, that since the code that Perl uses never closes ARGV, the $. variable never gets reset to 0. That means the line count keeps increasing:
while (<>) {
  print "$ARGV ($.): $_";
}
If we have two files, a.txt and b.txt whose contents are "abc\ndef\nghi\n" and "jkl\nmno\n" respectively, this program outputs:

a.txt (1): abc
a.txt (2): def
a.txt (3): ghi
b.txt (4): jkl
b.txt (5): mno
Now, what if we want the line number to be reset for each new file? We need to be able to detect the end of the file. We can do that with the eof() function! There are two ways we can use the function for detecting the end of each input:
while (<>) {
  print "$ARGV ($.): $_";
  close ARGV if eof;  # reset $.
}

# or

while (<>) {
  print "$ARGV ($.): $_";
  close ARGV if eof(ARGV);  # reset $.
}
If you don't use any parentheses, and don't send an argument, Perl will check the last filehandle read from. If you send an argument, it checks that filehandle. "But japhy! What about eof()?" you ask? Well, that's a very special case. If you want to know when you've reached the end of all the input, you can use eof():
while (<>) {
  print "$ARGV ($.): $_";
  print "==end==\n" if eof();  # after ALL data
}

Lazy Loops

In addition to the -i switch, Perl offers switches like -n and -p, which construct loops around the source of your code:
perl -ne 'print if /foo/' files
# becomes
perl -e 'while (<>) { print if /foo/ }' files

perl -pe 's/foo/bar/' files
# becomes
perl -e 'while (<>) { s/foo/bar/ } continue { print }' files
You can use -p with -i to write a simple one-liner file editor:
# keep backups
perl -pi.bak -e 's/PERL/Perl/g' files

# don't keep backups
perl -pi -e 's/PERL/Perl/g' files
Why do you think you have to say -pi -e, and can't use -pie?

References

Using files: File-specific variables: <> magic:
Email comments to japhy@pobox.com