Friday, September 28, 2012

An Exercise in Optimizing PHP

Last winter, I was optimizing a PHP reporting script for no real reason besides practicing optimization.

The overall structure looked something like:
# Ignore lines logged outside of business hours: off-peak periods
$filter = new TimeFilter(TIME_OPEN, TIME_CLOSE, WEEKEND_REGEX);
while (($line = fgets(STDIN)) !== false) {
  if (++$lctr % PROGRESS_STEP === 0) update progress indicator;
  if (! $filter->accept($line)) continue;
  parse line;
  update histograms;
}

display histograms;

class TimeFilter {
  protected $min, $max, $skip_re;
  function accept ($timecode) {
    if (preg_match($this->skip_re, $timecode)) return false;
    $ts = $this->to_ts($timecode);
    if ($ts === null) return false;
    return ($this->min <= $ts && $this->max >= $ts);
  }
  protected function to_ts ($timecode, $default=null) {
    if (preg_match('/(\\d{1,2}:\\d{2}:\\d{2})\\s*(am|pm)?/i', $timecode, $m)) {
      list($h,$m,$s) = explode(':', $m[1]);
      if ($m[2] && strtolower($m[2][0]) === 'p') $h += 12;
      return sprintf("%02d%02d%02d", $h, $m, $s);
    }
    return $default;
  }
}
There's a nice bug in there—$m[2][0] will never be "p" because the match array got overwritten by the explode call—but that turns out to be irrelevant because all the timestamps are 24-hour time in the first place.  I just wrote it to check am/pm because I was too lazy to read the actual log first.

In the beginning, on 512,280 lines of log file, 145,163 events were accepted, and the entire program took an average of 25.176 seconds to run.  (Run-time is measured as the user CPU time; for the initial benchmarking, there were 15 trials each time.)

As noted above, the AM/PM handling was unnecessary, so to_ts was immediately reduced to:
if (preg_match('/(\\d{2}:\\d{2}:\\d{2})/', $timecode, $m))
  return str_replace(':', '', $m[1]);
return $default;
This cut the average time down by 2.813 seconds, to 22.363 seconds.  Next, I saw that the comparisons to min and max would be the same if I left the ':' inside the numbers, which removes the call to str_replace; the captured expression can be returned as-is.

That cut the time down by 1.329 more seconds, to 21.035.  It was a nice improvement; it shaved around 16% of the time off for a tiny bit of work.  But could I do more?  There wasn't much left to do that didn't harm the flexibility of the program.  However, I didn't need any flexibility, since I will only ever parse the log format I actually have, so I completely removed the general "skip regex", and hard-coded the fastest possible check for a weekend day out in the main loop:
if ($line[0] === 'S') continue; # Sat/Sun
if (! $filter->accept($line)) continue;
This condition was met on 34.5% of the loops, meaning that 1/3 of the time, I saved a call into the filter. This netted the second-largest improvement of the day, a 4.937-second gain, cutting execution time to 16.097 seconds on average.

Next, since my program now knew for certain the format of the log (i.e. that it began with a fixed-format time, actually the result of localtime(time) in Perl), I figured I could save some effort by passing only the time-of-day to accept(), rather than the entire line.
if (! $filter->accept(substr($line, 11, 8))) continue;
This squeaked out another 1.618 seconds of improvement, reducing the average time to 14.479 seconds. The program now ran in 57.5% of its original time, and I began to get excited about the possibility of breaking the 50% mark. It was obvious that if I knew exactly where the time-of-day was, I could compare it directly to my min/max constants.
$ts = substr($line, 11, 8);
if ($ts < TIME_OPEN || $ts > TIME_CLOSE) continue;
Of course, since I had just eliminated $filter->accept() and its preg_match() call completely, this earned the best speedup all day, of 5.157 seconds; average execution time was down to 9.322 seconds, 37% of the original.  Cumulatively, I had trimmed off 15.854 seconds by this point.

With a little bit more work in my parsing code (replacing a preg_split('/\\s+/',...) with an explode(' ',...), worth 1.999 seconds) and pre-computing one of the constants used in bucketing values for the histograms (0.186 seconds), I brought the final time down to 7.137 seconds, or 28.3% of the original run time.

Epilogue: I remember cutting another significant amount off of the time by using PHP 5.3 compiled from the tarball instead of the particular system's PHP 5.2, and considered all together, this meant I ended up being able to parse a much larger log in under 20 seconds.  Unfortunately, I sat on this post too long (i.e.: I decided to wait until 5.2 was banished from all systems so I wasn't inviting you to h4xx0r us, then forgot entirely) and I don't have either the final line-count nor the exact 5.2-vs-5.3 timing anymore.

I'm also certain that that modulus with PROGRESS_STEP got changed to a bitmask version—instead of updating on human-readable numbers, it would have been rewritten as:
if ((++$lctr & 0x7fff) === 0) update progress indicator;
That just seems like my style.

No comments: