Monday, February 28, 2011

Variable scope, require, and use

I ran into some interesting problems in Perl, which invoked more learning around the require/use mechanisms and how constants are interpreted.  In this post, I'll lay out some general terms about variable scoping, such as lexical scope, dynamic scope, the differences between them, and how they all interact in Perl.  And then I'll cover require and use with that foundation in place.

If you've been wondering about lexicals or closures, this is your post.  I've tried to lay things out more or less from basic principles, despite the verbosity of the approach, because this has taken me years to understand.  I started programming with Perl in 2000 and still learned a bit more about it today.  Yes, it's 2011 now.  Hopefully, you can read this and get it in less time.


Lexical Scope

With lexical scope, the variables visible to a certain block of code depend on the physical layout of the source code.  A lexical variable can be used by code in the same block scope; typically in Perl, this is a package, a subroutine, an if/else structure, or a loop, and "a package" is often synonymous with an entire file.

An example is helpful here: note that lexical variables (variables to which lexical scoping rules are applied) are defined in Perl by the my keyword.  The standard use of lexicals in perl is for variables local to a subroutine:
$x = 10;
sub foo { my $x = 13; print "In foo, it's $x\n"; }
print "At first, x is $x.\n";
foo();
print "Back outside, it's $x.\n";
This will inform you that x is 10, 13, and 10.  foo's definition of my $x restricts the use of that variable to the scope of foo's declaration.  Instead, suppose the file declares $x with my, and the sub does not use my:
my $x = 10;
sub foo { $x = 13; print "Of course, foo sees $x.\n"; }
print "At first, $x is still $x.\n";
foo();
print "Afterward, it is $x.\n";
In this case, you'll see that foo was able to change the variable declared outside of it.  How?  The sub foo is defined within the scope where $x was declared lexical, in this case the file.  If you move the sub foo into a separate file called foo.pl:
sub foo { $x = 13; print "Of course, foo sees $x.\n"; }
1;
And then change the original to require foo.pl:
my $x = 10;
require 'foo.pl'; # THIS IS BAD CODE
foo();
print "With require, x is $x after calling foo.\n";
Now you'll see final value as 10 again.  The require establishes a new lexical scope, so foo can no longer see the my $x declaration in the main script.  Note the bad code comment; this is not a good way to use require, and I'll explain why later in this post.  I'm using it here simply to demonstrate the point.

Since foo.pl isn't modifying the lexical, what variable gets changed?  It's actually the global $x, since variables in Perl are global by default.  If you were to print the value of $main::x before and after the require line, the values would be undef and 13, respectively.

Where does 'closure' come into lexical scope, you ask?  This is your program:
sub series {
  my $x = 0;
  return sub { return ++$x; };
}
my $seq = series();
for my $i (20..24) { print $seq->(), "\n"; }
Well, this uses a lot more of Perl than the previous examples, but the output is pretty simple: 1 through 5.  What's going on?  The call to series() returns an anonymous sub, which we store in $seq as a code ref; then, we can call it with $seq->() where the parentheses are a standard argument list, which is empty in this case.  The print is just returning the value of that call, which is the next value of $x.  The loop variable runs from 20 to 24 just so you can see that the loop variable is not involved in the printing at all.

The anonymous sub is allowed to see the my $x that series defined, and it's always allowed to see it—even when the sub is passed out to code that cannot see it.  This is what lexical scope is all about: the variables visible to you, based on your location in the source, remain visible to you, and nobody from outside the scope can affect them.  (Unless your variable holds a reference to some data that they can modify.)

To differentiate variables like the last example's $x which are used in, but not defined within, a specific scope, the inner scope (the anonymous sub stored in $seq) is said to have "closed over $x", and $seq itself "is a closure" because it has closed over something.  The notion of closure separates purely local variables which are re-initialized on each call (as in the first example) from variables that are defined outside of a function, which can be consistent from invocation to invocation.  The variable doesn't live within the function's definition, so it is neither destroyed nor reset when the function returns or is called again.

There are two last Perl-specific things I want to point out before continuing.  For one, even when a local variable is defined, the global is still accessible under the fully-qualified name:
$x = 10;
sub foo { my $x = 13; print "my x is $x but yours is $main::x\n"; }
foo();
This will show the values 13 and 10.

Also, Perl offers the our keyword to "undo" the effects of a my:
$x = 10;
{
  my $x = 13;
  print "my x is $x\n";
  {
    our $x;
    print "our x is $x\n";
    $x = 42;
  }
  print "x is $x, I swear\n";
}
print "x is really $x\n";
This example will show values of 13, 10, 13, and 42.  I've used unnamed blocks here because they're more convenient for this example than trying to nest subs for no reason.  our establishes that, for its lexical container, the variable should reference the global variable and value.

Perl 5.10 added a couple of features: one is the ability to use new features added to Perl, and another is the specific inclusion of the 'state' feature.  (See perldoc feature for more in-depth information on features and their usage.)  state variables are similar to my variables that were declared in a separate scope just outside the function.  Only the current function can access it, and its value doesn't get reset when the function is called again.  Our example from above with series and seq could be written as:
use 5.010; # activate all 5.10 features
sub serial {
  state $x = 0;
  return ++$x;
}
for my $i (20..24) { say serial(); }
There are important differences.  In this example, there can only be one counter: you wouldn't be able to establish multiple independent sequences with multiple calls into the sub series that the other example featured.  Another important difference is that state variables can be initialized later than their traditional lexical counterparts.

Finally, use 5.010; is not a typo.  It's just the old version number format with 3 digits for the minor version (what perl calls "version" when reporting revision/version/subversion.)

Dynamic scope

In contrast to lexical scope, dynamic scope cares nothing for source code layout.  Instead, it's affected by the run time.  Perl's "global" scope is technically dynamic, and actual dynamic scoping is created by the local keyword.  How does it work?  Let's look at an example:
$x = 10;
sub foo { local $x = 42; bar(); }
sub bar { print "x seems to be $x\n"; ++$x; }
bar();
bar();
foo();
bar();
This shows the values 10, 11, 42, and... 12!  Each time it's called, bar increments the "global" variable.  However, foo establishes a new value for the variable, for itself and any function it calls, and this variable disappears again when foo returns (technically, the block containing the local exits).  So bar updated it to 43, but then it and foo returned, and the original variable (whose value was then 12) came back into effect.

foo would not have been able to affect any variables declared inside bar as lexicals with my, but using local to change the dynamic scope, foo can affect bar's view of the global variables. This would happen regardless of whether bar is in a different file. It's also possible to affect another package, if bar explicitly references the global in foo's package--or if the latter calls local on a punctuation variable like $/ which are always forced into a canonical package (main if I'm not mistaken).

Dynamic scope is "dynamic" because it can change from call to call, depending on whether any callers in the current call chain have used local or not.  There's nothing stopping a variable from being localized several times in the chain, either.  Under dynamic scope, the visible variables can be redefined by action outside your location in the source.  When bar() updates $x, it is not guaranteed to be updating the $x defined at the top of the file.

Perl's man pages note (or used to note) that if you're unsure, use my to define a local variable for your subroutine, not local.  This is a helpfully interpreted instruction to use lexical scope.  Dynamic scope turns out to be bad for larger systems, as your functions not only communicate through globals, but the "real" variable can be hidden and updated with a "fake" value.  Alternatively, forgetting to set (or not knowing that you need to set) a particular "fake" value before calling some subroutine can lead to surprisingly different results from a function call that appears to be the same: the argument lists may be the same, but all of the state the function relies on is not.

Thus, you most often see local used inside Perl to redefine one of Perl's special global variables, like $/, for the duration of a block in order to set it to a known value without disturbing the view of that value from the rest of the program.  Such code rarely calls down into other functions, since it needs to have the variable set for its own work.  Also, Perl only allows for lexicals that are alphanumeric, so local is the only way to apply a temporary value to one of the special variables.

The last thing to note about globals is that they are actually package global variables.  Take the following code:
$x = 12;
package Foo;
$x = 16;
sub foo { print "Foo's global x is $x\n"; }
package main;
sub print_x { print "main's x is $x\n"; }
Foo::foo();
print_x();
Note that for the sake of illustration, I've crammed everything into one file.  You wouldn't do this in real code.  An interesting side effect I discovered while testing this example is that the effect of our actually lasts to the end of file (or the next declaration of the same variable, as discussed below) on my Perl, so that if I define our $x = 16; inside package Foo, then the print_x subroutine uses it instead of $main::x!

Perl scope resolution

All of this has been implied already, but I want to bring it together here.  When an unqualified $x is used, how does Perl decide which variable that actually means?  It simply travels up the lexical scope chain, looking for a my or our declaration that applies for that variable.  If it is a lexical (declared with my or state), then it's done—Perl just uses that variable.

If there is another lexical of the same name earlier in the current scope, or at a more outer scope, then that earlier/outer variable becomes inaccessible from the later declaration forward.  This is known as the later variable shadowing the former.

If the variable is found and was declared as global with our, or not found in any lexical scope, then it is considered a global, which is taken from the dynamic scope.

Using local does not affect the resolution of a variable.  It must be global where the local is invoked, or else you'll see an error at compile time, "Can't localize lexical variable $x".

In other words, my and our lexically control whether a variable is lexical or global, respectively; local provides a mechanism to shadow global variables.

Why (and how) "use strict" complains so much

One goal of use strict (and the only goal of use strict 'vars') is to prevent you from unintentionally using a global variable when you meant to access a lexical one. When strict is in effect, traversing the entire lexical scope chain without finding a declaration of the variable with my or our triggers the error instead of falling back to global scope.  Fully-qualified variables aren't affected, because the full qualification implicitly means they are globals.  Lexical variables do not have a fully-qualified name.

Doing it wrong: "require $target" and "require 'file.pl'"

In the days before the package and module systems were invented for Perl 5, the way to load a library was through giving the path to the filename as a string to require.  This still works today, but is widely considered wrong, because of several subtle problems that it causes.  These problems are probably what led to its replacement with the modern module system.

The first problem is one of scope.  Using require this way will effectively import everything in the required file into your own scope, whether you wanted it or not—but only for global variables!  Lexicals remain local to their respective files.  It's easy to miss the distinction, or worse, have multiple files including the code, some of which have a global variable defined as lexical.

The second problem appears to be one of scope as well, but isn't.  Constants defined inside a file that has been loaded with require may not be visible to the file that performed the require.  Perl's constants are really constant, because they're determined at compile time.  If a constant wasn't defined at compile time, then it becomes a bareword, which may end up being the "constant" name.  And to make this distinction important, require is not normally executed until run time.  By then, the file doing the require has already been compiled without the constants set, and does not get recompiled after the require completes.  One quick patch is to use BEGIN { require "foo.pl"; } which will force the require to occur at compile time, which will define the constants in time for them to actually be useful.

A third problem is one of paths.  To follow the real-world example, if a Web server is set up to serve *.pl as CGI scripts, and this Web server changes directory to the running script, then the scripts loaded with code like require '../lib/site.pl'; cannot readily find the path to require more code from the lib directory themselves.

A fourth problem is one of Perl's magic: require includes an entry in %INC after loading the file, which means that the same file required twice will not be loaded the second time.  If you do try to load the same file twice using the same path name, this is almost certainly not what you want to happen.  Likewise, if you happen to load the same file twice by using different path names, when you expected to be able to load it only once, this can also result in undesirable effects, depending on how the loaded file is written.

There are probably more problems, but these are the ones I know of.

Doing it properly: "require Module" and "use Module"

The use SomeModule form imposes some restrictions on your code: the name you give it must be a bareword (an unquoted identifier, more or less); it must be available in the include path, @INC, after converting double-colons to path separators; it must be named with a .pm extension; and it must define the package SomeModule with an import subroutine for use to work.  For require, the import sub isn't entirely necessary, but you'd be crazy not to include it.

All these restrictions combine to make the code-inclusion mechanism more robust.  Modules have a canonical name, so they can't be loaded twice under different paths. This name resolves to the same path in the filesystem, independently of where the inclusion is initiated, which allows included files to include more files without worrying about the current directory or the path to the files being included.  The definition of a package name also creates a unique entry in the global namespace for the code, so that it doesn't need to be loaded more than once.

The import subroutine allows for a controlled inclusion of symbols into the caller's namespace, instead of dumping everything non-lexical (including all subroutines) into it.  use has the added benefit of implicitly wrapping BEGIN around the inclusion, so that constants defined in the included module are available for use in the caller without them needing to remember to write a BEGIN of their own.

Incidentally, you now know why constants are defined with use constant: the constant pragma wouldn't be able to do its job if it wasn't running at compile time.

Managing the Conversion from "require 'file.pl'" to "use SomeModule"

However, extolling the virtues of this system doesn't help you much if you have a codebase that relies on require 'foo.pl'; and its implicit export.  But, it's possible to write a file which can be easily converted from wrong to right.  Start by organizing the code into a package (copying the file to a '.pm' if necessary), then add some bridge code at the beginning to 'export' subs that simply call into the package:
sub frobnicate { &SomeModule::frobnicate; }
package SomeModule;
sub frobnicate {
  die "zomg, you didn't code this function";
}
Now, loading this code via require "../path/to/SomeModule.pm"; will create a frobnicate function that simply dies. The &Function; syntax with the ampersand and without the parentheses (both details are important!) will call function and alias its view of @_ to the current @_; this is usually not recommended, but we're doing dastardly things here, and this is just a shim that's not running any other code.  If SomeModule::frobnicate messes up @_, it doesn't affect the unqualified frobnicate, because it never uses the value itself.  It just returns the result back out, using Perl's implicit return feature.

When it comes time to make SomeModule available for use, the unqualified frobnicate gets deleted, and appropriate Exporter code added into the package.  Doing it in two phases like this lets you test that the simple reorganization into a package didn't break any of the callers, and that you have accounted for all the symbols that will need to be exported.

Then again, all this section may just be pointless, since you could go straight to using use and Exporter.  It's not like they're all that different from this hack.

This must be why Steve Yegge hates everything

Perl has a rich history starting with being a throwaway language with convenient features for the problems Larry Wall was working on at the time.  Since the Perl team has put a lot of work into backwards compatibility and convenience, it has created a lot of subtlety and minor traps.  You have to learn a lot about Perl before everything makes sense, and you're not just sprinkling my in front of every variable and mumbling, "WTF, why don't I just turn off use strict?  This doesn't seem helpful."

Likewise, the other languages he's shared his hate for have evolved quite a bit over time into their current chimerical forms: Javascript and Python both started with global variables and hacked lexical scope in later.  PHP probably gets the same complaint, but they added lexical scoping so recently that he hasn't had either time or inclination to blog about them yet, and we know the story anyway.  Lastly, C++ was an experiment in grafting a specific view of OOP into a language that was never meant to have it.  (For other views, consider Rees Re: OO.)

Lexical scope in other languages

Variables in Python began as members of one of three scopes: local to the current function, global to the current module, or built-in to Python.  The module global scope could be accessed using the global keyword, which is similar in spirit to Perl's our keyword.  Python added something like lexical variables as the nested_scopes feature, which allowed for a function nested within some other function to see variables of the outer function that weren't in global scope.  However, all variable writing was assumed to be local, so those outer variables were effectively read-only.  Some hacks arose around this, like pointing a variable in the outer function to a writable object, allowing for communication via object updates, but in Python 3, the nonlocal keyword has been introduced to allow for declaring that you want to use a name from the outer (but still not global) scope instead of local scope.

PHP began with two scope levels.  Variables were either global or local to a function, with globals accessible from a function using the global keyword, or at some point, the $GLOBALS superglobal was introduced.  (Superglobals main difference from globals is that they're pre-defined to be always global, so a function can use them without having to declare them with global first.)  This all changed with PHP 5.3.0, which introduced anonymous functions and the ability to use (close over) variables outside the function.

Unlike many other languages, however, PHP went its own way on two points: only anonymous functions may close over variables, and the variables to be closed over must also be included in the function declaration.  For example, $onUpdate = function ($x) use (&$y) { ... }; to allow the function to read and write the value of $y from the outer scope.  (The changes are probably justified: regular functions which need state should be objects, since PHP offers class-based OOP, and explicit naming of variables fits the philosophy of "it shouldn't be possible to have bugs because something was in an unexpected scope" that drove PHP's original scope design.)

In Conclusion

Lexical variables are visible based on their position in the source code; in Perl, this is from their declaration forward to the end of their block or file.  Lexical variables can be skipped over in favor of a global (package) variable, using our, which has the same area of effect as defining a lexical with my.

When a subroutine references a lexical variable in an outer scope, the subroutine is said to be "a closure", and it "closes over the variable".  This bears no relation to closing doors, emotional closure, or set theory (where, for instance, integers are closed under addition: add any two integers, and the result is still an integer.  This is not true of division.)

As of Perl 5.10, state variables are also possible, and are also private and lexically scoped; however, their value is saved between calls into the scope.  my variables would be re-initialized when their declaration occurs.

The value of dynamically scoped variables are controlled by the call stack; in Perl, this is accomplished by temporarily reassigning a global variable with local.  If you're not using local on Perl's special interpreter-control variables, then you are (or the code you're trying to affect is) most likely doing it wrong.

use strict prevents the fallback from lexicals to globals.  If the desired scope isn't selected lexically with my, our, or state, nor fully qualified by a package name (which implies a global variable), then the error is generated.

require $target or require "file.pl" are legacy features that have some unexpected interactions with modern Perl, and some limitations in creating truly reusable modules.  Instead, it is much preferred to create a package that can be imported with use; besides giving the caller control over the imported symbols, this also makes for a more robust module system that doesn't depend on the current working directory to find the modules.  (Unless you have use lib '.', but that seems inadvisable at best.)

Python and PHP have similar global keywords which are similar to our, and as of Python 3, the nonlocal keyword allows for a lexical variable to be written to (instead of creating a local variable of the same name).  PHP gained closures in version 5.3, but only for anonymous functions, through the optional use (...) clause on the function declaration.

No comments: