Tech Blog :: php


Jun 20 '12 2:06pm

Understanding MapReduce in MongoDB, with Node.js, PHP (and Drupal)

MongoDB's query language is good at extracting whole documents or whole elements of a document, but on its own it can't pull specific items from deeply embedded arrays, or calculate relationships between data points, or calculate aggregates. To do that, MongoDB uses an implementation of the MapReduce methodology to iterate over the dataset and extract the desired data points. Unlike SQL joins in relational databases, which essentially create a massive combined dataset and then extract pieces of it, MapReduce iterates over each document in the set, "reducing" the data piecemeal to the desired results. The name was popularized by Google, which needed to scale beyond SQL to index the web. Imagine trying to build the data structure for Facebook, with near-instantaneous calculation of the significance of every friend's friend's friend's posts, with SQL, and you see why MapReduce makes sense.

I've been using MongoDB for two years, but only in the last few months starting using MapReduce heavily. MongoDB is also introducing a new Aggregation framework in 2.1 that is supposed to simplify many operations that previously needed MapReduce. However, the latest stable release as of this writing is still 2.0.6, so Aggregation isn't officially ready for prime time (and I haven't used it yet).

This post is not meant to substitute the copious documentation and examples you can find across the web. After reading those, it still took me some time to wrap my head around the concepts, so I want to try to explain those as I came to understand them.

The Steps

A MapReduce operation consists of a map, a reduce, and optionally a finalize function. Key to understanding MapReduce is understanding what each of these functions iterates over.

Map

First, map runs for every document retrieved in the initial query passed to the operation. If you have 1000 documents and pass an empty query object, it will run 1000 times.

Inside your map function, you emit a key-value pair, where the key is whatever you want to group by (_id, author, category, etc), and the value contains whatever pieces of the document you want to pass along. The function doesn't return anything, because you can emit multiple key-values per map, but a function can only return 1 result.

The purpose of map is to extract small pieces of data from each document. For example, if you're counting articles per author, you could emit the author as the key and the number 1 as the value, to be summed in the next step.

Reduce

The reduce function then receives each of these key-value(s) pairs, for each key emitted from map, with the values in an array. Its purpose is to reduce multiple values-per-key to a single value-per-key. At the end of each iteration of your reduce function, you return (not emit this time) a single variable.

The number of times reduce runs for a given operation isn't easy to predict. (I asked about it on Stack Overflow and the consensus so far is, there's no simple formula.) Essentially reduce runs as many times as it needs to, until each key appears only once. If you emit each key only once, reduce never runs. If you emit most keys once but one special key twice, reduce will run once, getting (special key, [ value, value ]).

A rule of thumb with reduce is that the returned value's structure has to be the same as the structure emitted from map. If you emit an object as the value from map, every key in that object has to be present in the object returned from reduce, and vice-versa. If you return an integer from map, return an integer from reduce, and so on. The basic reason is that (as noted above), reduce shouldn't be necessary if a key only appears once. The results of an entire map-reduce operation, run back through the same operation, should return the same results (that way huge operations can be sharded and map/reduced many times). And the output of any given reduce function, plugged back into reduce (as a single-item array), needs to return the same value as went in. (In CS lingo, reduce has to be idempotent. The documentation explains this in more technical detail.)

Here's a simple JS test, using Node.js' assertion API, to verify this. To use it, have your mapReduce operation export their methods for a separate test script to import and test:

// this should export the map, reduce, [finalize] functions passed to MongoDB.
var mr = require('./mapreduce-query');
 
// override emit() to capture locally
var emitted = [];
 
// (in global scope so map can access it)
global.emit = function(key, val) {
  emitted.push({key:key, value:val});
};
 
// reduce input should be same as output for a single object
// dummyItems can be fake or loaded from DB
mr.map.call(dummyItems[0]);
 
var reduceRes = mr.reduce(emitted[0].key, [ emitted[0].value ]);
assert.deepEqual(reduceRes, emitted[0].value, 'reduce is idempotent');

A simple MapReduce example is to count the number of posts per author. So in map you could emit('author name', 1) for each document, then in reduce loop over each value and add it to a total. Make sure reduce is adding the actual number in the value, not just 1, because that won't be idempotent. Similarly, you can't just return values.length and assume each value represents 1 document.

Finalize

Now you have a single reduced value per key, which get run through the finalize function once per key.

To understand finalize, consider that this is essentially the same as not having a finalize function at all:

var finalize = function(key, value) {
  return value;
}

finalize is not necessary in every MapReduce operation, but it's very useful, for example, for calculating averages. You can't calculate the average in reduce because it can run multiple times per key, so each iteration doesn't have enough data to calculate with.

The final results returned from the operation will have one value per key, as returned from finalize if it exists, or from reduce if finalize doesn't exist.

MapReduce in PHP and Drupal

The MongoDB library for PHP does not include any special functions for MapReduce. They can be run simply as a generic command, but that takes a lot of code. I found a MongoDB-MapReduce-PHP library on Github which makes it easier. It works, but hasn't been updated in two years, so I forked the library and created my own version with what I think are some improvements.

The original library by infynyxx created an abstract class XMongoCollection that was meant to be sub-classed for every collection. I found it more useful to make XMongoCollection directly instantiable, as an extended replacement for the basic MongoCollection class. I added a mapReduceData method which returns the data from the MapReduce operation. For my Drupal application, I added a mapReduceDrupal method which wraps the results and error handling in Drupal API functions.

I could then load every collection with XMongoCollection and run mapReduce operations on it directly, like any other query. Note that the actual functions passed to MongoDB are still written in Javascript. For example:

// (this should be statically cached in a separate function)
$mongo = new Mongo($server_name);      // connection
$mongodb = $mongo->selectDB($db_name); // MongoDB instance
 
// use the new XMongoCollection class. make it available with an __autoloader.
$collection = new XMongoCollection($mongodb, $collection_name);
 
$map = <<<MAP
  function() {
    // doc is 'this'
    emit(this.category, 1);
  }
MAP;
 
$reduce = <<<REDUCE
  function(key, vals) {
    // have `variable` here passed in `setScope`
    return something;
  }
REDUCE;
 
$mr = new MongoMapReduce($map, $reduce, array( /* limit initial document set with a query here */ ));
 
// optionally pass variables to the functions. (e.g. to apply user-specified filters)
$mr->setScope(array('variable' => $variable));
 
// 2nd param becomes the temporary collection name, so tmp_mapreduce_example. 
// (This is a little messy and could be improved. Stated limitation of v1.8+ not supporting "inline" results is not entirely clear.)
// 3rd param is $collapse_value, see code
$result = $collection->mapReduceData($mr, 'example', FALSE);

MapReduce in Node.js

The MongoDB-Native driver for Node.js, now an official 10Gen-sponsored project, includes a collection.mapReduce() method. The syntax is like this:

 
var db = new mongodb.Db(dbName, new mongodb.Server(mongoHost, mongoPort, {}));
db.open(function(error, dbClient) {
  if (error) throw error;  
  dbClient.collection(collectionName, function(err, collection) {
    collection.mapReduce(map, reduce, { 
        out : { inline : 1 },
        query: { ... },     // limit the initial set (optional)
        finalize: finalize,  // function (optional)
        verbose: true        // include stats
      },
      function(error, results, stats) {   // stats provided by verbose
        // ...
      }
    });
  });
});

It's mostly similar to the command-line syntax, except in the CLI, the results are returned from the mapReduce function, while in Node.js they are passed (asynchronously) to the callback.

MapReduce in Mongoose

Mongoose is a modeling layer on top of the MongoDB-native Node.js driver, and in the latest 2.x release does not have its own support for MapReduce. (It's supposed to be coming in 3.x.) But the underlying collection is still available:

var db = mongoose.connect('mongodb://dbHost/dbName');
// (db.connection.db is the native MongoDB driver)
 
// build a model (`Book` is a schema object)
// model is called 'Book' but collection is 'books'
mongoose.model('Book', Book, 'books');
 
...
 
var Book = db.model('Book');
Book.collection.mapReduce(...);

(I actually think this is a case of Mongoose being better without its own abstraction on top of the existing driver, so I hope the new release doesn't make it more complex.)

In sum

I initially found MapReduce very confusing, so hopefully this helps clarify rather than increase the confusion. Please write in the comments below if I've misstated or mixed up anything above.

Mar 24 '11 10:34am

Track Freshbooks Expenses in Google Docs with PHP and XML

I've been trying to automate as much of my financial forecasting as possible, with coding up front that will last a while. My primary tools are Freshbooks (for expense and invoice tracking) and Google Docs for spreadsheets. I wrote yesterday about pulling data from one spreadsheet into another using importRange. Last night I took it several steps further, pulling expenses from the Freshbooks API into XML, then XML to GDocs, and automating tax calculations based on expense category.

1. Freshbooks Expenses to XML

Building on an existing freshbooks-php library, I wrote a PHP script called freshbooks_expenses_xml. (Link goes to GitHub.)

To get it set up, create a keys.php file, and put the whole package on your server somewhere. Play with the parameters described in the readme to get different XML output.

2. XML to Google Docs

In cell A1 of a clean spreadsheet, enter this function:
=importXML("http://your-site.com/freshbooks-expenses/expenses.php?date_from=2011-01-01&date_to=2011-12-31&headers=1&", "//expenses/*").
GDocs will fetch the data and populate the spreadsheet. (Note: I had some trouble making the headers consistent with the columns, and worked around it; you might want to do the same by omitting headers=1 in the function and putting in your own.)

3. Making useful tax calculations with the data

For estimated quarterly taxes (as an LLC), I need to know my revenue (calculated in another spreadsheet, not yet but possibly soon also pulled automatically from Freshbooks) minus my business expenses. As I learned doing taxes for 2010, not all expense categories are equal: Meals & Entertainment, for example, is generally deducted at 50%, while others are 100%. This is easy to do with custom GDocs functions. Next to my expenses (pulled automatically), I have a column for Month, a column for Quarter (using a custom function), and a column for Deduction, using the amount and the category. (To write a custom function, go to Tools > Scripts > Script Editor.)

Finally, in my income sheet, I use sumif() on the range in the other [expenses] sheet with the calculated deductions for that quarter, times my expected tax rate, and I know how much quarterly taxes to pay!

(Update: A revised version of this post now appears on the FreshBooks Developer Blog.)

Mar 13 '11 7:56pm

Switching PHP memcache.so extension to memcached.so

I was having some caching issues earlier that I concluded were memcache-related. The memcache terminology is confusing: 'memcache' is the colloquial name, 'memcached' is the daemon, and php has both memcache and memcached extensions. The memcache module for Drupal supports both, but recommends the memcached version. I was running the other one, so I decided to switch to see if that would fix my problems.

The swap was harder than I expected, so here's how I did it, in case anyone else wants to do the same. This assumes you already have the daemon and old memcache library working correctly.

First try the simple method. This didn't work for me because I didn't have libmemcached installed. If it works for you, you're lucky:

sudo pecl install memcached
(I specified the version, memcached-1.0.2, to make sure I got the latest stable release, but that number might change by the time you read this.)

Anyway, that didn't work for me - I got an error, Can't find libmemcached headers". The documentation specifies a --with-libmemcached-dir parameter to handle this. But I didn't have the library installed anywhere, so I had to install it. (Fully install it, not just download it.)

Using /opt to hold the files, the latest version of libmemcached, and running as root (otherwise add sudo to each line, or at least to the make install step).

cd /opt
wget http://launchpad.net/libmemcached/1.0/0.40a/+download/libmemcached-0.40.tar.gz
tar -xzvf libmemcached-0.40.tar.gz
cd libmemcached-0.40
./configure
make
make install

Now try the simple method again: sudo pecl install memcached. If that still doesn't work, specify the directory manually:

cd /opt
pecl download memcached-1.0.2
tar zxvf memcached-1.0.2.tgz
cd memcached-1.0.2
phpize
./configure --with-libmemcached-dir=/opt/libmemcached-0.40/libmemcached
make
make install

(Play around with the configure line there if it still fails. I tried 100 variations until I got it working - I think with pecl install after the full make install on libmemcached - but your results may vary.

If this worked, there should now be a memcached.so file in your PHP extensions directory.

Now for the php config: the documentation on memcached's runtime configuration is sparse. The Drupal module recommends setting memcache.hash_strategy="consistent", however, I'm not sure if this has any effect on memcached.so. In my setup there was a conf.d/memcache.ini file, symlinked to cli/conf.d (for command line config) and apache2/conf.d. I changed the extension call to the new file, removed extraneous configs that didn't seem to be documented anywhere, and set the hash_strategy for good measure. Then I checked the config with apache2ctl configtest (will differ by distro), that checked out, so I restarted apache. phpinfo() showed the new extension, my caching problem went away, and all seems well so far.

Feb 23 '11 11:20am

Drupal as an Application Framework: Unofficially competing in the BostonPHP Framework Bakeoff

BostonPHP hosted a PHP Framework Bake-Off last night, a competition among four application frameworks: CakePHP, Symfony, Zend, and CodeIgniter. A developer coding in each framework was given 30 minutes to build a simple job-posting app (wireframes publicized the day before) in front of a live audience.

I asked the organizer if I could enter the competition representing Drupal. He replied that Drupal was a Content Management System, not a framework, so it should compete against Wordpress and Joomla, not the above four. My opinion on the matter was and remains as follows:

  1. The differences between frameworks and robust CMSs are not well defined, and Drupal straddles the line between them.
  2. The test of whether a toolkit is a framework is whether the following question yields an affirmative answer: “Can I use this toolkit to build a given application?” Here Drupal clearly does, and for apps far more advanced that this one.
  3. The exclusion reflects a kind of coder-purist snobbery ("it's not a framework if you build any of it in a UI") and lack of knowledge about Drupal's underlying code framework.
  4. In a fair fight, Drupal would either beat Wordpress hands-down building a complex app (because its APIs are far more robust) or fail to show its true colors with a simple blog-style site that better suits WP.

Needless to say, I wasn't organizing the event, so Drupal was not included.

So I entered Drupal into the competition anyway. While the first developer (using CakePHP) coded for 30 minutes on the big screen, I built the app in my chair from the back of the auditorium, starting with a clean Drupal 6 installation, recording my screen. Below is that recording, with narration added afterwards. (Glance at the app wireframes first to understand the task.)

Worth noting:

  • I used Drupal 6 because I know it best; if this were a production app, I would be using the newly released Drupal 7.
  • I start, as you can see, with an empty directory on a Linux server and an Apache virtualhost already defined.
  • I build a small custom module at the end just to show that code is obviously involved at anything beyond the basic level, but most of the setup is done in the UI.


One irony of the framework-vs-CMS argument is that what makes these frameworks appealing is precisely the automated helpers - be it scaffolding in Symfony, baking in CakePHP, raking in Rails, etc - that all reduce the need for wheel-reinventing manual coding. After the tools do their thing, the frameworks require code, and Drupal requires (at the basic level) visual component building (followed, of course, by code as the app gets more custom/complex). Why is one approach more "framework"-y or app-y than the other? If I build a complex app in Drupal, and my time spent writing custom code outweighs the UI work (as it usually does), does that change the nature of the framework?

Where the CMS nature of Drupal hits a wall in my view is in building apps that aren't compatible with Drupal's basic assumptions. It assumes the basic unit - a piece of "content" called a "node" - should have a title, body, author, and date, for example. If that most basic schema doesn't fit what you're trying to build, then you probably don't want to use Drupal. But for many apps, it fits well enough, so Drupal deserves a spot on the list of application frameworks, to be weighed for its pros and cons on each project just like the rest.

Oct 27 '10 3:43pm
Tags

PHP foreach/reference oddities

I encountered this weirdness today with some basic PHP:

Early in the script I had: (note the reference &$nid)

// validate
foreach($nids as $key => &$nid) {
  if (empty($nid)) unset($nids[$key]);
  if (! is_numeric($nid)) unset($nids[$key]);
  $nid = (int) $nid;
}
 
print_r($nids);

Outputs:

Array
(
[0] => 81
[1] => 1199
)

Then later on...

  foreach($nids as $nid) {
    echo $nid;
  }

Outputs:
81
81

If I change the 2nd loop to:

  foreach($nids as &$nid) {
    echo $nid;
  }

then it outputs 81 and 1199 like it should.

Shouldn't as $nid reset the variable so it's no longer a reference?

Jun 27 '10 11:05pm
Tags

Debugging PHP with XAMPP, MacGDPp, and Textmate

Technosophos has a great tutorial on setting up a PHP debugging environment with XDebug and MAMP. I'm using XAMPP and it works the same way, just change the path where xdebug.so goes.

However, the Textmate part - using the xdebug.file_link_format parameter - doesn't seem to be working. Apparently others are having the same problem, possibly Snow Leopard-related, not sure if there's a solution. It's not necessary for the debugger to work, however, just a convenient way to view the error-causing code.

Feb 3 '10 10:38am

HipHop PHP Engine by Facebook

Facebook formally presented its HipHop project last night. (Video below.) PHP is written in C but interpreted and runtime, and trades code simplicity for performance. So HipHop aims to convert PHP into optimized C++ "just in time" (which I think is the same as runtime), then compile that C++ and run it much faster than PHP would otherwise run. They've been running it live for six months and claim it uses 50% less CPU than the standard engine with equal traffic, and 30% less CPU with twice the traffic (compared to the Zend engine with APC opcode cache).

Most of the "magical" features supported in PHP (but not in C++) were preserved, but eval(), which allows arbitrary code to be run in the script, was removed. This means Drupal can't use HipHop, for one thing.

The optimization potential depends on "how much of your code looks like C++?" Flexible variable types, for instance, run slower than type-cast variables, so HipHop has an "inference engine" to convert to C++ variable types, gaining performance for clear types but not so much when using "variant" types.

HipHop also uses its own HTTP server, so no Apache support (yet). Tabini notes, "Of course, this doesn’t preclude you from running one or more HipHop projects against separate ports on the same machine and then use Apache (or Squid, or any other server) to reverse proxy to them."

It'll all be open source, of course: the project home is here, and code will be on GitHub "soon."

Update: Four Kitchens ponders ways Drupal could be modified to support HipHop. (The changes suggested there should probably be done regardless.) I look forward to seeing that in their Pressflow distribution.

Dec 23 '09 2:53pm

Path Traversal is frighteningly simple

This StackOverflow question about path traversal prompted me to see how easy it is.

All it takes is a PHP file like this on your server:

<?php
// explore path traversal vulnerabilities
ini_set('display_errors', 'on');
ini_set('error_reporting', E_ALL);
 
$path = isset($_GET['path']) ? $_GET['path'] : '';
 
  if (empty($path)) {
    echo "No path.";
    die;
  }
 
echo $path . '<br/>' . realpath($path) . '<hr/>';
 
if (is_dir($path)) {
  echo '<pre>' . print_r(scandir($path),true) . '</pre>';
}
else {
  $file = file_get_contents($path);
  echo htmlspecialchars($file);  
}

... and someone can gain total read access to your file system. Run that script with ?path=../../etc/passwd, for example, and the system's user list is printed straight to the screen. (Because most Unix systems set --4 [all-read] permissions by default on system files.) (So DO NOT put that code on your server!

Of course, that exact code would never be used, but there are all kinds of other scenarios where user-submitted parameters or cookies are passed through to the file system. That's one of the advantages of working in a framework (vs coding an app from scratch) - all these considerations have (presumably) been taken into account, and the API (if used correctly) should handle it. But it just reminds me how critical it is to escape all characters, never pass through form values directly, never load files based on unfiltered user input, etc etc... Apache's access directives are useless once the script is running server-side.

Nov 13 '09 3:09pm
Tags

Old-school SSI on PHP5/Apache2.2

I had to set up a local copy of an old site running (on the server) PHP 4 and hundreds of old-school SSIs (server-side includes), of the <!--#include variety. It took a bunch of time to get it right, but in the end it's pretty simple:

      Make sure mod_include is enabled in your httpd.conf.
      Have this directive in your httpd.conf (it should be possible in the virtualhost or .htaccess too but doesn't work there:
      <Directory />
      Options Includes
      </Directory>
      (It's possible the Directory can be other than root (/) -- I think the key is just to put it in httpd.conf.)
      Also in httpd.conf, put the directive SetOutputFilter INCLUDES.

Prior to PHP 4.6 I think, there was a configuration directive --enable-track-vars which allowed SSI variables to work PHP, or something like that, but it's deprecated and built in now, so PHP can output SSI directives.

Make sure to restart Apache after adding this, of course.

If I run into other issues with this, I'll update this post; in the meantime, it seems to be working.