Sunday, September 4, 2011

sapphirepaw's Introduction to Pointers, version 2

I programmed in assembly for some time, using pointers without understanding what they were, or that they were called pointers.  When I finally got to learning C, the pointer syntax was downright inscrutable, but when I got it, suddenly all of C and all of assembler laid clear before me, all at once.  It was a beautiful thing.

I was reminded about this while reading this post from HN.  It inspired me to try explaining pointers from the opposite direction.  Instead of trying to teach pointers via C syntax, let me try to start with pointers outside of programming, then discuss them in relation to C and PHP.


Pointers IRL: use and mention

A distinction can be made between using a word and mentioning it.  Consider the sentence, "The password is incorrect."  If it is saying that the wrong password has been given, then 'incorrect' is being used; if it is saying that the word 'incorrect' is the password, then incorrect is being mentioned.  When mentioned, the word is not used for its meaning, but for the word itself.

In effect, a word is a pointer to its definition.  Normally, a word is used, so there is no written convention to signify use.  We just write the words.  When mentioned, the word is typically quoted or italicized, to indicate the mention and separate it from a mundane use.  The example sentence would then become, "The password is 'incorrect'."  In speech we don't have quotation marks, except for the much-maligned air-quotes, so inflection and timing are used instead: "The password is... incorrect."

Steve Yegge has a talk somewhere on branding, and how brands are also pointers.  They're like any other word, except that they have to be distinctive names, and the company associates this name to their specific product.  Thus, when anyone says 'Dr. Pepper,' there is a single product that comes to mind.  Yet there's nothing specific to branding that makes a brand into a pointer, without applying to other words.  The only difference is that a brand's word—its pointer—is invented so that the owner can have exclusive control over the definition, instead of the multiple definitions that ordinary words commonly have.

Variables in Memory

Our next stop on this trip is to examine the structure of the average C program.  Everything the program is working on has to be held in memory.  This consists of the actual program instructions, an area called the heap for large or long-lived data that the program needs to handle, and a dual-purpose area known as the stack.  The stack's primary purpose is for holding return addresses, so that when a function is called, the place to return to is stored on the stack, and is read when the function returns in order to access the next instruction in the caller.  The nature of the stack also makes it possible for individual functions to store small, temporary data there.

In order to catch runaway programs, the stack often has a limited size.  It may only be possible to store 8 MB there, while the heap may hold 1,000 MB or more.

To use the value stored in memory, a variable has an address.  At some level, every variable is a pointer, pointing to its associated storage.  Normal use of a variable, as in "c = a+b", reads or modifies the contents of memory at the variable's address, like normal uses of words rely on their definitions.

Function Calls

What happens when a variable is passed to a function?  It's used in the call, so its contents are retrieved, then given to the function, which reserves its own storage for its parameters, then places the value it received into its respective new address.  Any use of the variable within the function uses this new address, and the caller's copy is unaffected because it's stored at a different address.

If the stack is a small pile of index cards, which can be taken, written on in pencil, erased, and then returned in order, then calling a function takes an index card for each parameter, copies the value from the given parameter onto the card, and gives the card to the function.

This works fairly well for small data, like numbers, but what if we loaded a copy of War and Peace into the heap, and wanted to find the length of it?  It's long enough to take a lot of time to copy onto a series of index cards, and we may not even have enough cards available to do so.  After all, the stack is limited in size.  It would be ideal if we could simply tell the function where to find the text, instead of trying to send it a copy of the text itself.

If we could attach a cord to the text, and the other end to an index card, then the function could follow that cord to get to the text, without us having to copy it anywhere.  And, it would only use one of our cards, instead of consuming a huge amount of stack space.

This is essentially what a pointer is.  If we pass the address of the start of the text, then the called function (the callee) can go directly to that address and start reading the text to find its length.  In this case, we are now mentioning where the text is located, instead of using the whole text itself.

char *: Your First Pointer

This is exactly how it works in C.  In fact, C is famous for not having a specific "string" type, because strings are treated as arrays of characters, with an array being a continuous series of addresses all holding the same type.  A string begins at some starting address, then continues until some address whose content is 0, also known as the '\0' character, which is simply pre-defined to represent the end of a string.

C's string is represented to the programmer as an array of characters, but pointers and arrays are practically equivalent in C.  Thus, char[] and char * more or less represent the same type: a variable which holds the address of a character.  When code operates on a string, it assumes that more characters follow, according to the convention above.

Therefore, when we call strlen(war_and_peace), we give it the address of war_and_peace rather than the text itself (because as a string, war_and_peace already is a pointer to the text, the address of the start of the text); and for its part, strlen expects this, and dereferences the pointer it receives to work with the data.  It actively converts the mention into a use.

Yet, there's no reason that pointers must point to primitive values, like "char" in "char *".  A pointer's contents can be another pointer.  If you have a pointer-to-char, you have an array of char, which is a string; it follows that pointer-to-pointer-to-char is pointer-to-string, which is an array of string.  Notably, this is how command-line arguments appear to a program.  (Although it's still easier to reason about as char*[] than char** for me, even though the effect is equivalent.)

Use and Mention in C

There are two pointer-related functions of the * operator in C.  In a declaration, it tells how many levels of pointers are needed to reach the final variable contents.  Outside of a declaration, which I'll refer to as "as an instruction", it accesses the content of a pointer, yielding the pointed-to type.  Both of these may be stacked, as we saw with char** for declarations.  For instructions, a function with a signature of int strlen(char *s) would access the character at s with *s; the first letter of the first string in char **argv could be accessed with **argv.  Likewise, *argv would yield the entire first argument as a string (since it would move from char** to char* as a type), for instance if you wanted to find strlen(*argv).

There is one more pointer-related operator, &.  Given a variable, instead of using its value, we can mention it (retrieving its address) with the ampersand.  Consider a function is defined as int get_config_int(char* name, int* out) which returns a success or error code as its return value, and on success writes the value associated with name into *out.  If the caller has some variable defined as int mem_limit which it would like to use, it can use the & operator to get the address of mem_limit to pass to get_config_int, as in: ok = get_config_int("max_mem", &mem_limit);

Unlike *, & can't be repeated.  A normal variable has an address; but the address does not have its own place to be stored.  It can be stored into some other variable of the appropriate type, and that variable can have its address taken, but this is simply returning the address of the second variable, not the address of an address.

I've been using the terms already, but to keep things as clear as possible, in a declaration, char **argv reads literally as "char-pointer-pointer", though I usually swap the order to get "pointer-to-pointer-to-char".  In either case, this can be translated to "string array" as well, but I think of that as a translation rather than how it appears in the code.  In an instruction, *out might be read as "value-at-out", though I'm used to C enough to just read it as "star-out" and know what it means.  Finally, &mem_limit is conveniently said as "address-of-mem_limit".

Thinking in Pictures

It helps, especially when building larger data structures, to draw out the relationship between variables and pointers in a diagram.  The usual way is a box-and-arrow diagram, in which the boxes represent storage space.  Individual boxes may be named, if they are a variable.  The boxes contain either a primitive value or a pointer, the latter of which is represented by an arrow leading to the box it points to.

Here's an example with a hypothetical char **argv, for a program named "hello" run with one argument of "there".  I've additionally chosen to list types beneath the boxes to help demonstrate how a star is consumed when a pointer is followed.


Note that the two actual strings, and the intermediate array, don't actually have names: they're accessed using the argv variable and some combination of dereferencing (another technical name for following the pointer, converting its mention to a use).  Simply using the * operator to follow the pointer doesn't let us reach "sideways" into the arrays, though—for that, the natural way is to use e.g. argv[1] to access the string "there".

Not Explaining C

There are a number of details I've forced myself to leave out, since I'm trying to make this more about pointers and less about C.  This includes pointer arithmetic, a long discussion of equivalence between pointer arithmetic and array syntax, when and why you use pointers and pointers-to-pointers, and so forth.  These things are simply beyond the scope of this article.

Pointers Meet PHP

In PHP, pointers don't exist as a first-class concept.  However, a few things are built on the same ideas, which tends to make the manual complicated when it doesn't want to use the word "pointer" for any of them.

Let us first consider the case of a humble variable.  A PHP variable name is a string pointing to a value.  In contrast to C, where a variable has a type, it is the value which carries the type in PHP.  (This is what lets PHP be dynamically typed: a name may point to different values over time, and each value may be of any type.)  Another difference to C is that PHP always uses values through a name, so it doesn't provide a way to access the address of a value.  If you want to use it, you need one of its names.

References in PHP are simply pointing a second name to the same value.  When you write $y =& $x, then you are doing a pointer copy, similar to y = x; in C when both are of type int*.

Variable-variables, which let you do $x=42; $y="x"; echo $$y; to print 42, are just a chain of ordinary variable lookups.  $y is not directly pointing to the value of $x, but only to a string value containing the name "x", which is then immediately used to look for the value that $x is pointing to.

PHP's manual also has a long explanation regarding unset(), which becomes fairly simple to describe with pointers: unset() removes the name and its link to its value, leaving the value unaffected (unless this was the last name for the value, at which point the value is garbage collected).  $y =& $x; unset($y); does not result in observable change to $x.  The equivalent C is y = x; y = NULL; and this also does not affect x.

The other major caution regarding unset() is that the global keyword creates a reference, so that global $x; unset($x); has no effect, as with any other reference: you delete the current $x, which points to the same value as the global $x, without affecting the pointed-to value.

PHP variables also act like they're copied by value when used, except for object instances as of PHP 5.0.  This works by secretly including another pointer: an object-type value holds not the actual object, but an identifier for the instance, also known as an "object handle".  When you use such a variable, PHP (under the hood) notices that it's an object, and fetches the real object based on the identifier in the value.  This way, the value can be copied like any other, yet still references "the same object" which is nearly always what you want.

You can generate a new instance by using clone on the value, which makes PHP actually create a new instance, with a new identifier, and return a new value which contains that new identifier.  I think I used this precisely once in my career, back when I was less skilled as a system designer.

Actually, this value-is-a-pointer pattern is not new with OO: resources work the same way.  When you get a file pointer with fopen(), assigning it does not open the file again.

But what about assembler?

It's pretty much irrelevant by now.  But just for fun, here's how things looked in AssemPro on our Amiga, running a 68000:

move.l #42, d0  ; load a constant, aka immediate value, into register d0
lea intbase, a0  ; load &intbase into a0 (PC-relative)
move.l #intbase, a6  ; load &intbase as a constant into a6 (not PC-relative)
move.l intbase, a6  ; load *(&intbase) into a6
move.l 0(a6), d0  ; load *(a6) into d0: a6 must hold some pointer aka address

The third example loads the value of &intbase at the time of assembly.  This may not be the actual address of the intbase label once the code is loaded to run, which is why relocation was invented.  (Though I am entirely clueless about how AmigaOS actually handled them.)

I can't remember now if the fourth example is PC-relative or not.  I didn't understand yet why you would write such code, when I was doing this stuff in the 1990s, and it was "more restrictive" so I usually didn't bother. PC-relative code used only relative offsets to the current instruction as addresses, so it could be loaded at any position in memory without having to be relocated, nor having to use an offset table.

Feedback Still Encouraged

This post has been largely rewritten for quality, but suggestions and questions are still welcome in the comments.  In spite of my efforts to pare down the excess, there remain many digressions and hints of deeper layers throughout.  I have just too much knowledge to know what's useful and what's extraneous for the topic of pointers specifically.

No comments: