Sunday, May 22, 2016

Perl Retrospective

Perl was the first language I really did serious work in.  To this day, even though it’s dying, I still like it.  It’s different, which is at once both its major virtue, and its Achilles heel in the modern programming landscape.

Let’s talk about what makes Perl Perlish, startng with the good parts.

(Note that I use a lot of 5.10 features like say and maybe the // operator.  For the most part, assume a use v5.10; line before any code. I may not adhere to use strict; use warnings; in this post, but my production code does.)

Good: Statement Modifiers

Possibly the most interesting thing about Perl are the statement modifiers, in conjunction with implicit variables.  The current list of directories that will be searched for modules with use can be listed, one per line, in a line of code:

say for @INC;

(say defaults to printing $_ and for ... as a modifier sets $_ to each element of the list in turn.)

Whereas in any other language, you'd need to write a whole loop in “forward” form, and name the loop variable:

$inc = explode(PATH_SEPARATOR, ini_get("include_path"));
foreach ($inc as $dir) {
    echo $dir, PHP_EOL;

Note that it’s not all that common to provide a println style function anymore, that prints its arguments and then a new line.  We have to add PHP_EOL ourselves.

Perl’s modifiers include all of the conditionals and loops that the language provides, including both ‘positive’ and ‘negative’ versions.

next unless $line =~ /fail/i;
++$hits if $line =~ /$MATCH_RE/;
copy($f, $dstdir) while defined($f = shift @ARGV) && ! $CANCELED;
sleep(1) until $GOT_SIGNAL;

Modifiers can make the code read more like human language.

die "object method" unless ref $_[0];
die "class method" if ref $_[0];

Guidelines on when to choose modifiers vs. regular statements are roughly, “Use modifiers if the action taken is more important than the condition, and the condition is simple enough.”

The other main advantage modifiers have in Perl in particular is that they are shorter.  Regular if statements and others must include braces, so you have your choice between:

die "object method" unless ref $_[0];


unless (ref $_[0]) {
    die "object method";

The latter is considered somewhat verbose for the simplicity of it, and would not be seen in idiomatic Perl.

Similar to the modifiers in the way it makes code read like human language, there are low-precedence boolean operators. &&, ||, and ! work roughly like C, while and, or, and not come in below every other operator.

$dbh = DBI->connect($dsn, $user, $pass, $options)
  or die "Database connection failure: $DBI::errstr";

In this case, or has lower precedence than =, so the error is thrown only if the variable gets a false value assigned to it.  (Well, technically this is a bad example, because using || would have the same effect.  The die would interrupt the result of die being assigned into $dbh.)

tl;dr for the section: modifiers and the low-precedence booleans enable code to be arranged in more natural-language-like ordering.  This can make simple code take up fewer lines than an if or similar statement, which require braces.

Good: Regular Expression Syntax

I’m coming around to the view that string manipulation isn’t something you want to do all the time, but having the regular expression operators as syntactic constructs is pretty nice too.  It continues to set Perl apart from most languages, other than Ruby.

There’s not much I really have to say about this, aside from pointing out some of the regular expression operators:

if ($x =~ m/foo/)    # if $x matches /foo/
$value =~ s/\s+\z//; # remove trailing whitespace
$rx = qr/bar/i;      # regex stored in variable
if ($x =~ /$rx/)     # using the stored regex

If $rx were to be printed, it would show up as a string like (?i-msx:bar) which is the original expression, wrapped in “local flags” so that /$rx/ is still a case-insensitive match of bar even though the outer pattern is case-sensitive.

Outside of that, my favorite approach is actually PHP’s preg_match() and related functions.  Getting a boolean matched/did not match in the return value and capturing groups via an output by-reference parameter is a pretty good approach.

I’m probably corrupted by my time writing constructs like:

if ($line =~ /f(o+)o/i && length($1) > 3) {
    say "long foo in line: $line";

Which is a straightforward translation to PHP:

if (preg_match('/f(o+)o/i', $line, $m) && strlen($m[1]) > 3) {
    echo 'long foo in line: ', $line, PHP_EOL;

It bothers me that it takes at least one extra line to do this in Python, because assignment is a statement, and can’t be part of an expression to if. Of course, Python also lacks output parameters.  It’s simple, but that simplicity restricts the design of interfaces.

Good: Extended Quote Syntaxes

Many languages distinguish between “normal” strings and “literal” ones to some degree.  PHP in particular inherits the distinction between double and single quoting:

"$thing" # the string value of the $thing variable
'$thing' # a string of exactly six characters

Perl offers this same distinction, but extended with a syntax that enables variable quote characters.  There are “quote operators” that act like the familiar single and double quotes, but enable the delimiting character to be chosen by the programmer.  Here are some examples, with their PHP equivalents:

qq{$thing} # "$thing"
q{$thing}  # '$thing'
qw(a b c)  # array('a', 'b', 'c')
qr#^/usr#i # "(?i-msx:^/usr)"
qx/date/   # shell_exec("date")

(preg_quote isn't really the equivalent of qr; corrected 2017-04-28.)

This also applies to the regular expression operators.  Very useful at times, for instance:

return -1 if $path =~ m#^/boot/grub#;
$element = qq{<div style="display:none;">O'Hare</div>};

Without being able to change the delimiter from the default /, all the slash characters in the path would have to be backslash-escaped in the first line. Likewise, double-quotes would have to be escaped in the second.

If you’ve wondered why preg_match() takes delimiters and flags in PHP, this is why.  It’s a feature imported from Perl.

The qw operator (for “quote words”) considers any run of non-whitespace as a single “word” and gets a lot of use on module imports, for either individual functions, or tags of functions:

use CGI qw(:standard);
use List::Util qw(min max);

It’s a curious limitation that Perl’s import system doesn’t allow things to be renamed on import, but it seems to work well enough in practice.

Good: Deep Unix Integration

Well, it’s not entirely good, because it made portability a bit harder.  Not everything ports to other platforms without restriction.  But the embrace of Unix made using it on Unix a smooth, seamless experience.

The language provides access to chmod, umask, or even utime (what you want when you’d run the touch command.) It has fork and execsystem is there.  So are backtick operators (equivalent to the qx operator) that run commands and capture their output.

It all makes putting together “Unix automation” tasks much more convenient, since the language already speaks that, and programmers aren’t stuck using only functionality common to all systems the language was designed for.

It’s important to realize that these functions were also part of the global namespace, with no prefixes, and no imports required to use them.  There was no need to stop and fish in man pages to figure out whether the functions were part of stdlib.h or unistd.h or anything.

PHP’s file manipulation on Unix approaches the experience of Perl, but some of the other Unix-isms aren’t quite as perfect.  All of the functions like posix_kill depend on the posix extension being loaded, and of course, have the posix_ prefix.

(That’s one approach to portability: make system integration optional and provide functions like extension_loaded and function_exists to detect the presence of the feature.)

Weird: Indirect Object Syntax

What's the difference between these lines?

say STDERR "oh dear";
say STDERR, "oh dear";

The first prints to standard error. The second prints the standard error handle and the string to the currently selected output (STDOUT by default, naturally.)

The first is "indirect object syntax", where STDERR is the "object" being used in the call. IIRC, prior to 5.14, using IO::Handle made handles look even more like objects, by allowing STDERR->say("oh dear"); notation.

Anyway, the indirect object syntax is like a little party trick that comes out for print STDERR $x and new Foo::Bar... which, yes, invokes the "class method" Foo::Bar->new(). Any string used in object notation is treated as a package name and passed in place of a real object.

Mixed: Dynamic Scope and local

Perl has a bunch of global variables.

Perl also has a mechanism to change their current value within an executing block.  Witness:

sub dostuff {
    my ($fh) = @_;
    print scalar <$fh>;
dostuff(\*STDIN); # prints one line
    local $/;
    dostuff(\*STDIN); # prints rest of stream

In the second call, local has modified dostuff’s view of $/, the record separator; in this case, it’s undefined, so Perl won’t consider any character as a line break.  So the second invocation reads to end-of-file instead of end-of-line.

local shadows the global variable until control exits the lexical scope containing the local definition.  Creating a temporary, auto-restoring value for a global variable is a great tool to have for a language that has basic features such as “how a file reading operator works” controlled by global variables.

But it’s a solution to a problem that Perl didn’t need to have.  It’s more of an accident of history and heritage that Perl has a “file reading operator” and global variables controlling it, than a fundamentally sound design.

(Basically, I’d consider an approach where file handles had their own copy of controlled state like “what is a newline” controlled through methods on the handle.  A lot like ioctl() but less cryptic.)

Mixed: Shortcuts

Perl has a lot of handy shortcuts.

my @h = qw(HUNDRED BILLION);
say @h;   # "HUNDREDBILLION"
say "@h"; # "HUNDRED BILLION"
    local $" = '~~'; # [corrected 2016-06-14]
    say "@h"; # "HUNDRED~~BILLION"

The second line prints each element of the array as-is, and the third line puts it in quotes to invoke the spacing shortcut.  The rest peels back the curtain a bit, to show that the shortcut is built on a magic global variable. The net effect is join($", @h).

But only in double-quotes.

This kind of thing is handy, but also means that there's a lot to learn about Perl.

Overall, I’d say it also makes code harder to read, because one must be aware of all of the Perl features that a given block of code is using.  It’s not entirely obvious that there’s even a difference between @h and "@h".

On the other hand, it’s really convenient to have the shortcuts when they’re usable and effective.

Mixed: Prototypes

Perl has a lot of built-in functions that work in semi-magical ways, like how grep takes a block that returns whether the item matched some desired condition.

@files = grep { $_ !~ /^\.{1,2}$/ } readdir(DIR);

(The above filters ‘.’ and ‘..’ out of a directory listing, if you were wondering.)

“Prototypes” are Perl’s way to expose this sort of semi-magic to the common programmer:

sub doMany (&$) {
    my ($sub, $count) = @_;
    for my $c (1..$count) {
doMany { say "hi $_[0]" } 3;

This is great, but it immediately comes at a cost.  Without a prototype, a wrapper function can call doMany(@args) and as long as the first two items in the list are the correct type, doMany can run just fine.

Once the definition has a prototype, it will require callers to conform to the exact constraints of the prototype.  Unless they go to the trouble of calling it as &doMany(@args) (note the added ampersand.)

This is especially insidious when a prototype starts $; because it will accept a list like @_ as an argument.  It’s just that the prototype will cause that list to be in scalar context, and the function will be passed only the length of the arguments that were meant to be passed.  (More about context later.)

Despite these downsides, the prototyping mechanics have been used to make incredibly useful things like Try::Tiny and basically half of the Dancer web framework.  Try::Tiny in particular allows for chaining prototyped functions it defines:

use Try::Tiny;
try {
    die "oh no!";
} catch {
    say STDERR "ignoring error: $_";
} finally {
    say "continuing on";

This uses the prototype system to define functions which appear to act like syntax extensions to the language.  This basically amounts to a killer feature of the system.

With one gotcha—note the trailing semicolon.  Since these are function calls under the hood, they need to be terminated like any other statement.

Mixed: Running Code at Compile Time

Perl has a set of magic blocks that have effects at different times.

  • BEGIN { … }: code in the block is run as soon as the closing curly brace is read, before the rest of the file is even parsed.
  • CHECK { … }: runs the code in the block after initial compilation is complete. CHECK blocks are run in reverse order of the interpreter encountering them.
  • UNITCHECK { … }: runs the code in the block after compilation of the current file is complete. Also in reverse order, but before CHECK blocks. Available since Perl 5.10.
  • INIT { … }: code in the block is run before the main program starts, after CHECK blocks. These cause a warning (and are skipped entirely) if one is encountered after the main program has started.
  • END { … }: code in the block is run after the main program has ended, in reverse order of the interpreter encountering them.

BEGIN and CHECK/UNITCHECK happen at compilation time, which means they run when perl -c is used to “only” compile the code.  This means you can do amazing stuff, like fully parse your configuration file, check it for errors, and cache it as a data structure before runtime ever starts… but in that case, you’ll need a valid configuration file on any machine that wants to syntax-check the project.

I once made the mistake of connecting to the database at compilation time. Then, an entire database had to be available and populated for the syntax check.

Used appropriately, the idea of compile-time execution is great.  It’s the mechanic that underlies constants in Perl: they’re functions that return a constant value that’s known at compile time, which makes the parser replace them with the constant value instead of emitting a runtime function name lookup and call.

It’s also the mechanism that makes prototypes work.  use imports the module at compile time, which defines the function before the file that invoked use is parsed.  With the function defined, the compilation can use the loaded prototype to recognize code like try { … } finally { … }; without issuing syntax errors.

In other words, it’s right between “You can shoot yourself in the foot really easily with this” and “Your life can be much easier using this.”

Bad: Context

What's the difference between these two statements?

my ($one) = @stuff;
my $one = @stuff;

In the first case, $one is assigned the first element from the @stuff list.  In the second case, it is the length of that list.  The first provides “list context” to the evaluation of @stuff and thus returns its contents, while the second provides “scalar context” and returns a scalar—which, for lists, is the length of the list.

Context, in conjunction with wantarray (which returns 1, 0, or undef for indicating whether the current function was invoked in list, scalar, and void context, respectively), can be a wickedly powerful feature when the situation calls for it.

Unfortunately, that’s quite rare, and most of the time, the programmer is reminded that Perl has context by some stupid bug like taking the length of a list instead of getting items out of it.

Context bugs plagued the CGI module, since it was easy to call param() in list context, which meant that a list of all the values for a parameter would be returned instead of only the first one.  Consider this code:

my $hash = {
    one => param('one'),
    two => 2

A query string of one=1&one=six&one=6 would, when processed by that code, build a hash that unexpectedly contained a six => 6 item.

This tended to cause security issues for CGI-based apps so often that recent versions of the CGI module now warn when calling param in list context. The “all values” feature, without the warning, is the multi_param() function instead.

(CGI has been removed from core because it is basically not the way anyone should write a web app. It’s the PHP 3 of the Perl world, these days. But, I nonetheless had to deal with it.)

One of the other considerations of context is that code which calls code, like the Try::Tiny module’s whole point of existence, must preserve the context when invoking the blocks.  It has to use wantarray to check the context it’s executing in, then carefully provide that context to the called function so that it can pass the return value back appropriately.

Context is, as far as I know, very unique to Perl.  Most of the time, list context is default, unless assigning to a scalar or using an operator like || that provides scalar context.  Or, of course, calling a function in void context.  Most of the time, the code you want to write does what you want, but sometimes, dropping some parentheses will be a major bug.

Context is something that’s pervasive, but mostly invisible, and as such, prone to being misunderstood when reading code that actually depends on the context.  It’s never obvious from a call site that a function being called will behave differently in different context.

As such, the one time I’ve written a context-dependent function professionally, there are a lot of comments on what the function returns both at its definition, and at each of the call sites.

tl;dr: expressions evaluate in a context and may return different results based on what that context is, for the same text evaluated.  This is hard. Let’s go shopping.

Bad: Sigils, Namespaces, References, and Context

The interaction of context and a few other things quickly turns confusing.

Check out this intentionally difficult code:

my ($h, @h) = ("TEN", "HUNDRED", "BILLION");
my %h = (one => 1, two => 2);
say "$h";              # "TEN"
say "$h{one}";         # "1"
say "@h";              # "HUNDRED BILLION"
say "@h{qw(one two)}"; # "1 2"
# same hash operations, but with a reference...
my $x = \%h;
say "$x->{one}";          # "1"
say "@{$x}{qw(one two)}"; # "1 2"

Quirk 1: the scalar, list, and hash namespaces are all separate.

Quirk 2: the sigil (the $ or other sign) used to access a variable depends on the context being provided for the access, not the variable’s “home” namespace.  So $h{one} accesses a single (scalar) value in %h, and @h{...} provides a list of values from %h.

Ouirk 3: references are always scalar values, so hash references and other scalar variables won’t live in separate namespaces.  They’ll all end up in scalar-land as $-something.  When a different context is needed, the different context variable (with curly braces) comes out for the show.

I know all this. I’m downright fluent in it (admittedly, only after looking up that “slice of a hash ref” code on the web enough times.) I still mess it up sometimes. And trying to teach Perl to someone and watching him struggle with the “basics” of when to use arrows vs. when not to, and what sigils to use? It makes me think there’s a better way…

And that way doesn’t involve context being chosen by sigils.

I’m not sure what else to compare Perl to here.  Lisp-2 systems have the distinction between functions and variables (#'map vs. 'map) but there’s no other language that comes to mind with multiple namespaces like that.

Bad: Weak OO System

Perl doesn't specify much in the way of an object system. Any reference can be turned into an object with bless, which associates that reference with a package. Any functions defined within that package are then callable as methods.

By default, this includes any functions from modules that were imported, which is why namespace::clean and similar modules exist. They scrub those function names from the package's globals, moving them to lexicals instead so that the package can still call them. (For a very, very short explanation: lexical is private, scoped to the block/file that something appears in within the source code.)

There's no way to detect when a function has been called as a method, or vice versa, save for including runtime checks on whether "enough" arguments were provided, and whether the first argument is/isn't a reference. Objects must be references, because "class methods" like what is typically named new receive a string instead. The not-reference nature of that argument indicates that it is the class method, not a method that was called on an object.

The path of least resistance is to choose a hash reference and have all the properties be public. Like Python, there's a convention that a leading underscore means "please don't mess with this, it's a secret" but there's no enforcement.

Taken all together, this lackadaisical approach has spawned a number of class systems as modules, and the darling has changed with the years. Moose is the current fashion, but it's also "heavy" (read: full-featured) and so there are a couple of "lighter" versions I've run across (Mouse and Moo), plus Any::Moose to try and use whichever one happens to be loaded/available. The dark side of "there's more than one way to do it" is that eventually, all the ways to do it will be piled into one code base. (Everyone was still figuring out "how to do events" as we went along, but the array of event systems we ended up with receives the same criticism from me.)

Bad: Lack of version information in documentation

(This isn't part of the language itself, but it's part of the experience of programming with Perl, so I included it.)

Perl assumes that, if you are reading the 5.10 documentation, you will be programming with 5.10 and do not need to know when things like UNITCHECK were added.  Back when I was using 5.6 and FreeBSD included 5.005 (the numbering system changed) in the base system, I would get occasional bug reports from users that things didn’t run on their platform.

PHP, on the other hand, is very clear about when things were added.  It’s entirely possible to take the current PHP 7.1 documentation and write for PHP 5.3.3, if you had to.

Approximately the one place in the system that mentions version numbers is in the documentation for use feature, because it spells out what features are available in which versions.  Sadly, those aren’t all the changes, and the page doesn’t go back prior to the debut of the feature module in 5.10.

On the other hand, there’s not much need to write code to Perls older than that anymore.  Still, writing for “5.16 and up” remains an annoying task where you can’t just look at 5.22’s documentation; you have to get 5.16 or use if it works. (Spoiler: at time of writing, that is a 404 reached through their version selection UI.  5.16.1 works, though.)

Out of Scope: CPAN

CPAN was once considered to be the real secret to Perl’s success: a central module distribution platform that meant anybody could write modules that did what you needed.

In the years since CPAN, pretty much every important language has a similar system for bundles and installers: PyPI (nee the Cheese Shop, of course) and pip; ruby gems; lua rocks; Packagist and Composer.

This has come full circle into Perl with cpanminus and Carton…

No comments: