Tech Blog :: mongodb


Jun 20 '12 2:06pm

Understanding MapReduce in MongoDB, with Node.js, PHP (and Drupal)

MongoDB's query language is good at extracting whole documents or whole elements of a document, but on its own it can't pull specific items from deeply embedded arrays, or calculate relationships between data points, or calculate aggregates. To do that, MongoDB uses an implementation of the MapReduce methodology to iterate over the dataset and extract the desired data points. Unlike SQL joins in relational databases, which essentially create a massive combined dataset and then extract pieces of it, MapReduce iterates over each document in the set, "reducing" the data piecemeal to the desired results. The name was popularized by Google, which needed to scale beyond SQL to index the web. Imagine trying to build the data structure for Facebook, with near-instantaneous calculation of the significance of every friend's friend's friend's posts, with SQL, and you see why MapReduce makes sense.

I've been using MongoDB for two years, but only in the last few months starting using MapReduce heavily. MongoDB is also introducing a new Aggregation framework in 2.1 that is supposed to simplify many operations that previously needed MapReduce. However, the latest stable release as of this writing is still 2.0.6, so Aggregation isn't officially ready for prime time (and I haven't used it yet).

This post is not meant to substitute the copious documentation and examples you can find across the web. After reading those, it still took me some time to wrap my head around the concepts, so I want to try to explain those as I came to understand them.

The Steps

A MapReduce operation consists of a map, a reduce, and optionally a finalize function. Key to understanding MapReduce is understanding what each of these functions iterates over.

Map

First, map runs for every document retrieved in the initial query passed to the operation. If you have 1000 documents and pass an empty query object, it will run 1000 times.

Inside your map function, you emit a key-value pair, where the key is whatever you want to group by (_id, author, category, etc), and the value contains whatever pieces of the document you want to pass along. The function doesn't return anything, because you can emit multiple key-values per map, but a function can only return 1 result.

The purpose of map is to extract small pieces of data from each document. For example, if you're counting articles per author, you could emit the author as the key and the number 1 as the value, to be summed in the next step.

Reduce

The reduce function then receives each of these key-value(s) pairs, for each key emitted from map, with the values in an array. Its purpose is to reduce multiple values-per-key to a single value-per-key. At the end of each iteration of your reduce function, you return (not emit this time) a single variable.

The number of times reduce runs for a given operation isn't easy to predict. (I asked about it on Stack Overflow and the consensus so far is, there's no simple formula.) Essentially reduce runs as many times as it needs to, until each key appears only once. If you emit each key only once, reduce never runs. If you emit most keys once but one special key twice, reduce will run once, getting (special key, [ value, value ]).

A rule of thumb with reduce is that the returned value's structure has to be the same as the structure emitted from map. If you emit an object as the value from map, every key in that object has to be present in the object returned from reduce, and vice-versa. If you return an integer from map, return an integer from reduce, and so on. The basic reason is that (as noted above), reduce shouldn't be necessary if a key only appears once. The results of an entire map-reduce operation, run back through the same operation, should return the same results (that way huge operations can be sharded and map/reduced many times). And the output of any given reduce function, plugged back into reduce (as a single-item array), needs to return the same value as went in. (In CS lingo, reduce has to be idempotent. The documentation explains this in more technical detail.)

Here's a simple JS test, using Node.js' assertion API, to verify this. To use it, have your mapReduce operation export their methods for a separate test script to import and test:

// this should export the map, reduce, [finalize] functions passed to MongoDB.
var mr = require('./mapreduce-query');
 
// override emit() to capture locally
var emitted = [];
 
// (in global scope so map can access it)
global.emit = function(key, val) {
  emitted.push({key:key, value:val});
};
 
// reduce input should be same as output for a single object
// dummyItems can be fake or loaded from DB
mr.map.call(dummyItems[0]);
 
var reduceRes = mr.reduce(emitted[0].key, [ emitted[0].value ]);
assert.deepEqual(reduceRes, emitted[0].value, 'reduce is idempotent');

A simple MapReduce example is to count the number of posts per author. So in map you could emit('author name', 1) for each document, then in reduce loop over each value and add it to a total. Make sure reduce is adding the actual number in the value, not just 1, because that won't be idempotent. Similarly, you can't just return values.length and assume each value represents 1 document.

Finalize

Now you have a single reduced value per key, which get run through the finalize function once per key.

To understand finalize, consider that this is essentially the same as not having a finalize function at all:

var finalize = function(key, value) {
  return value;
}

finalize is not necessary in every MapReduce operation, but it's very useful, for example, for calculating averages. You can't calculate the average in reduce because it can run multiple times per key, so each iteration doesn't have enough data to calculate with.

The final results returned from the operation will have one value per key, as returned from finalize if it exists, or from reduce if finalize doesn't exist.

MapReduce in PHP and Drupal

The MongoDB library for PHP does not include any special functions for MapReduce. They can be run simply as a generic command, but that takes a lot of code. I found a MongoDB-MapReduce-PHP library on Github which makes it easier. It works, but hasn't been updated in two years, so I forked the library and created my own version with what I think are some improvements.

The original library by infynyxx created an abstract class XMongoCollection that was meant to be sub-classed for every collection. I found it more useful to make XMongoCollection directly instantiable, as an extended replacement for the basic MongoCollection class. I added a mapReduceData method which returns the data from the MapReduce operation. For my Drupal application, I added a mapReduceDrupal method which wraps the results and error handling in Drupal API functions.

I could then load every collection with XMongoCollection and run mapReduce operations on it directly, like any other query. Note that the actual functions passed to MongoDB are still written in Javascript. For example:

// (this should be statically cached in a separate function)
$mongo = new Mongo($server_name);      // connection
$mongodb = $mongo->selectDB($db_name); // MongoDB instance
 
// use the new XMongoCollection class. make it available with an __autoloader.
$collection = new XMongoCollection($mongodb, $collection_name);
 
$map = <<<MAP
  function() {
    // doc is 'this'
    emit(this.category, 1);
  }
MAP;
 
$reduce = <<<REDUCE
  function(key, vals) {
    // have `variable` here passed in `setScope`
    return something;
  }
REDUCE;
 
$mr = new MongoMapReduce($map, $reduce, array( /* limit initial document set with a query here */ ));
 
// optionally pass variables to the functions. (e.g. to apply user-specified filters)
$mr->setScope(array('variable' => $variable));
 
// 2nd param becomes the temporary collection name, so tmp_mapreduce_example. 
// (This is a little messy and could be improved. Stated limitation of v1.8+ not supporting "inline" results is not entirely clear.)
// 3rd param is $collapse_value, see code
$result = $collection->mapReduceData($mr, 'example', FALSE);

MapReduce in Node.js

The MongoDB-Native driver for Node.js, now an official 10Gen-sponsored project, includes a collection.mapReduce() method. The syntax is like this:

 
var db = new mongodb.Db(dbName, new mongodb.Server(mongoHost, mongoPort, {}));
db.open(function(error, dbClient) {
  if (error) throw error;  
  dbClient.collection(collectionName, function(err, collection) {
    collection.mapReduce(map, reduce, { 
        out : { inline : 1 },
        query: { ... },     // limit the initial set (optional)
        finalize: finalize,  // function (optional)
        verbose: true        // include stats
      },
      function(error, results, stats) {   // stats provided by verbose
        // ...
      }
    });
  });
});

It's mostly similar to the command-line syntax, except in the CLI, the results are returned from the mapReduce function, while in Node.js they are passed (asynchronously) to the callback.

MapReduce in Mongoose

Mongoose is a modeling layer on top of the MongoDB-native Node.js driver, and in the latest 2.x release does not have its own support for MapReduce. (It's supposed to be coming in 3.x.) But the underlying collection is still available:

var db = mongoose.connect('mongodb://dbHost/dbName');
// (db.connection.db is the native MongoDB driver)
 
// build a model (`Book` is a schema object)
// model is called 'Book' but collection is 'books'
mongoose.model('Book', Book, 'books');
 
...
 
var Book = db.model('Book');
Book.collection.mapReduce(...);

(I actually think this is a case of Mongoose being better without its own abstraction on top of the existing driver, so I hope the new release doesn't make it more complex.)

In sum

I initially found MapReduce very confusing, so hopefully this helps clarify rather than increase the confusion. Please write in the comments below if I've misstated or mixed up anything above.

May 24 '12 1:10pm

Quick tip: Share a large MongoDB query object between the CLI and Node.js

I was writing a very long MongoDB query in JSON that needed to work both in a Mongo CLI script and in a Node.js app. Duplicating the JSON for the query across both raised the risk of one changing without the other. So I dug around for a way to share them and came up with this:

Create a query.js file, like so (the query is just an example, substitute your own):

// dual-purpose, include me in mongo cli or node.js
var module = module || {};    // filler for mongo CLI
 
// query here is for mongo cli. module.exports is for node.js
var query = module.exports = {
  '$and' : [
    { '$or' : [ 
      { someField: { '$exists': false } }, 
      { someOtherField: 0 } 
    ] },
 
    { '$or' : [ 
      { 'endDate' : { '$lt' : new Date() } },
      { 'endDate' : { '$exists' : false } }
    ] }
 
    // ...  
  ]
};

Then in your mongo CLI script, use

load('./query.js');   // sets query var
 
db.items.find(
    query,
 
    {
      // fields to select ...
    }
  )
  .sort({
    // etc...
  });

In your Node.js app, use var query = require('./query.js'); and plug the same query into your MongoDB driver or Mongoose find function. Duplication avoided!

May 22 '12 9:58pm

Using Node.js to connect with the eBay APIs on AntiquesNearMe.com

We recently rolled out a new space on Antiques Near Me called ANM Picks, which uses eBay's APIs to programmatically find high-quality antique auctions using the same metrics that Sean (my antique-dealer business partner) uses in his own business. We'll cover the product promotion elsewhere, but I want to write here about how it works under the hood.

The first iteration a few weeks ago simply involved an RSS feed being piped into the sidebar of the site. The site is primarily a Drupal 6 application, and Drupal has tools for handling feeds, but they're very heavy: They make everything a "node" (Drupal content item), and all external requests have to be run in series using PHP cron scripts on top of a memory-intensive Drupal process - i.e. they're good if you want to pull external content permanently into your CMS, but aren't suited for the kind of ephemeral, 3rd-party data that eBay serves. So I built a Node.js app that loaded the feed periodically, reformatted the items, and served them via Ajax onto the site.

The RSS feed gave us very limited control over the filters, however. To really curate the data intelligently, I needed to use eBay's APIs. They have dozens of APIs of varying ages and protocols. Fortunately for our purposes, all the services we needed spoke JSON. No single API gave us all the dimensions we needed (that would be too easy!), so the app evolved to make many requests to several different APIs and then cross-reference the results.

Node.js is great for working with 3rd-party web services because its IO is all asynchronous. Using async, my favorite Node.js flow-control module, the app can run dozens of requests in parallel or in a more finely-tuned queue. (Node.js obviously didn't invent asynchronous flow control - Java and other languages can spawn off threads to do this - but I doubt they do it as easily from a code perspective.)

To work with the eBay APIs, I wrote a client library which I published to npm (npm install ebay-api). Code is on Github. (Ironically, someone else published another eBay module as I was working on mine, independently; they're both incomplete in different ways, so maybe they'll converge eventually. I like that in the Node.js ecosystem, unlike in Drupal's, two nutcrackers are considered better than one.) The module includes a ebayApiGetRequest method for a single request, paginateGetRequest for a multi-page request, a parser for two of the services, a flatten method that simplifies the data structure (to more easily query the results with MongoDB), and some other helper functions. There are examples in the code as well.

Back to my app: Once the data is mashed up and filtered, it's saved into MongoDB (using Mongoose for basic schema validation, but otherwise as free-form JSON, which Mongo is perfect for). A subset is then loaded into memory for the sidebar Favorites and ANM Picks. (The database is only accessed for an individual item if it's no longer a favorite.) All this is frequently reloaded to fetch new items and flush out old ones (under the eBay TOS, we're not allowed to show anything that has passed).

The app runs on a different port from the website, so to pipe it through cleanly, I'm using Varnish as a proxy. Varnish is already running in front of the Drupal site, so I added another backend in the VCL and activate it based on a subdomain. Oddly, trying to toggle by URL (via req.url) didn't work - it would intermittently load the wrong backend without a clear pattern - so the subdomain (via `req.http.host) was second best. It was still necessary to deal with Allow-Origin issues, but at least the URL scheme looked better, and the cache splits the load.

Having the data in MongoDB means we can multiply the visibility (and affiliate-marketing revenue) through social networks. Each item now has a Facebook Like widget which points back to a unique URL on our site (proxied through Drupal, with details visible until it has passed). The client-side JS subscribes to the widgets' events and pings our app, so we can track which items and categories are the most popular, and (by also tracking clicks) make sure eBay is being honest. We're tuning the algorithm to show only high-quality auctions, so the better it does, the more (we hope) they'll be shared organically.

Comments? Questions? Interested in using the eBay APIs with Node.js? Feel free to email me or comment below.

Apr 29 '12 10:06am

Liberate your Drupal data for a service-oriented architecture (using Redis, Node.js, and MongoDB)

Drupal's basic content unit is a "node," and to build a single node (or to perform any other Drupal activity), the codebase has to be bootstrapped, and everything needed to respond to the request (configuration, database and cache connections, etc) has to be initialized and loaded into memory from scratch. Then node_load runs through the NodeAPI hooks, multiple database queries are run, and the node is built into a single PHP object.

This is fine if your web application runs entirely through Drupal, and always will, but what if you want to move toward a more flexible Service-oriented architecture (SOA), and share your content (and users) with other applications? For example, build a mobile app with a Node.js backend like LinkedIn did; or calculate analytics for business intelligence; or have customer service reps talk to your customers in real-time; or integrate with a ticketing system; or do anything else that doesn't play to Drupal's content-publishing strengths. Maybe you just want to make your data (which is the core of your business, not the server stack) technology-agnostic. Maybe you want to migrate a legacy Drupal application to a different system, but the cost of refactoring all the business logic is prohibitive; with an SOA you could change the calculation and get the best of both worlds.

The traditional way of doing this was setting up a web service in Drupal using something like the Services module. External applications could request data over HTTP, and Drupal would respond in JSON. Each request has to wait for Drupal to bootstrap, which uses a lot of memory (every enterprise Drupal site I've ever seen has been bogged down by legacy code that runs on every request), so it's slow and doesn't scale well. Rather than relieving some load from Drupal's LAMP stack by building a separate application, you're just adding more load to both apps. To spread the load, you have to keep adding PHP/Apache/Mysql instances horizontally. Every module added to Drupal compounds the latency of Drupal's hook architecture (running thousands of function_exists calls for example), so the stakeholders involved in changing the Drupal app has to include the users of every secondary application requesting the data. With a Drupal-Services approach, other apps will always be second-class citizens, dependent on the legacy system, not allowing the "loose coupling" principle of SOA.

I've been shifting my own work from Drupal to Node.js over the last year, but I still have large Drupal applications (such as Antiques Near Me) which can't be easily moved away, and frankly don't need to be for most use cases. Overall, I tend to think of Drupal as a legacy system, burdened by too much cruft and inconsistent architecture, and no longer the best platform for most applications. I've been giving a lot of thought to ways to keep these apps future-proof without rebuilding all the parts that work well as-is.

That led me to build what I've called the "Drupal Liberator". It consists of a Drupal module and a Node.js app, and uses Redis (a very fast key-value store) for a middleman queue and MongoDB for the final storage. Here's how it works:

  • When a node (or user, or other entity type) is saved in Drupal, the module encodes it to JSON (a cross-platform format that's also native to Node.js and MongoDB), and puts it, along with metadata (an md5 checksum of the JSON, timestamp, etc), into a Redis hash (a simple key-value object, containing the metadata and the object as a JSON string). It also notifies a Redis pub/sub channel of the new hash key. (This uses 13KB of additional memory and 2ms of time for Drupal on the first node, and 1KB/1ms for subsequent node saves on the same request. If Redis is down, Drupal goes on as usual.)

  • The Node.js app, running completely independently of Drupal, is listening to the pub/sub channel. When it's pinged with a hash key, it retrieves the hash, JSON.parse's the string into a native object, possibly alters it a little (e.g., adding the checksum and timestamp into the object), and saves it into MongoDB (which also speaks JSON natively). The data type (node, user, etc) and other information in the metadata directs where it's saved. Under normal conditions, this whole process from node_save to MongoDB takes less than a second. If it were to bottleneck at some point in the flow, the Node.js app runs asynchronously, not blocking or straining Drupal in any way.

  • For redundancy, the Node.js app also polls the hash namespace every few minutes. If any part of the mechanism breaks at any time, or to catch up when first installing it, the timestamp and checksum stored in each saved object allow the two systems to easily find the last synchronized item and continue synchronizing from there.

The result is a read-only clone of the data, synchronized almost instantaneously with MongoDB. Individual nodes can be loaded without bootstrapping Drupal (or touching Apache-MySql-PHP at all), as fully-built objects. New apps utilizing the data can be built in any framework or language. The whole Drupal site could go down and the data needed for the other applications would still be usable. Complex queries (for node retrieval or aggregate statistics) that would otherwise require enormous SQL joins can be built using MapReduce and run without affecting the Drupal database.

One example of a simple use case this enables: Utilize the CMS backend to edit your content, but publish it using a thin MongoDB layer and client-side templates. (And outsource comments and other user-write interactions to a service like Disqus.) Suddenly your content displays much faster and under higher traffic with less server capacity, and you don't have to worry about Varnish or your Drupal site being "Slashdotted".

A few caveats worth mentioning: First, it's read-only. If a separate app wants to modify the data in any way (and maintain data integrity across systems), it has to communicate with Drupal, or a synchronization bridge has to be built in the other direction. (This could be the logical next step in developing this approach, and truly make Drupal a co-equal player in an SOA.)

Second, you could have Drupal write to MongoDB directly and cut out the middlemen. (And indeed that might make more sense in a lot of cases.) But I built this with the premise of an already strained Drupal site, where adding another database connection would slow it down even further. This aims to put as little additional load on Drupal as possible, with the "Liberator" acting itself as an independent app.

Third, if all you needed was instant node retrieval - for example, if your app could query MySql for node ID's, but didn't want to bootstrap Drupal to build the node objects - you could leave them in Redis and take Node.js and MongoDB out of the picture.

I've just started exploring the potential of where this can go, so I've run this mostly as a proof-of-concept so far (successfully). I'm also not releasing the code at this stage: If you want to adopt this approach to evolve your Drupal system to a service-oriented architecture, I am available as a consultant to help you do so. I've started building separate apps in Node.js that tie into Drupal sites with Ajax and found the speed and flexibility very liberating. There's also a world of non-Drupal developers who can help you leverage your data, if it could be easily liberated. I see this as opening a whole new set of doors for where legacy Drupal sites can go.

Oct 16 '11 9:24pm

Exploring the node.js frontier

I have spent much of the last few weeks learning and coding in Node.js, and I'd like to share some of my impressions and lessons-learned for others starting out. If you're not familiar yet, Node.js is a framework for building server-side applications with asynchronous javascript. It's only two years old, but already has a vast ecosystem of plug-in "modules" and higher-level frameworks built on top of it.

My first application is a simple web app for learning Spanish using flashcards. The code is open on Github. The app utilizes basic CRUD (Create-Retrieve-Update-Delete) functionality (of "Words" in this case), form handling, authentication, input validation, and an end-user interface - i.e. the basic components of a web app. I'm using MongoDB for the database and Express.js (which sites on top of Connect, on top of Node) as the web framework. For templating I learned Jade, and for easier CSS I'm using LessCSS.

In the process of building it, I encountered numerous challenges and questions, some solved and many still open; found some great resources; and started to train my brain to think of server-side code asynchronously.

Node is a blank slate

Node "out of the box" isn't a web server like Apache; it's more of a language, like Ruby. You start with a blank slate, on top of which you can code a daemon, an IRC server, a process manager, or a blog - there's no automatic handling of virtualhosts, requests, responses, webroots, or any of the components that a LAMP stack (for example) assumes you want. The node community is building infrastructural components that can be dropped in, and I expect that the more I delve into the ecosystem, the more familiar I'll become with those components. At its core, however, Node is simply an API for asynchronous I/O methods.

No more linear flow

I'm used to coding in PHP, which involves linear instructions, each of them "blocking." Take this linear pseudocode snippet for CRUD operations on a "word" object, for example:

if (new word) {
  render an empty form
}
else if (editing existing word) {
  load the word
  populate the form
  render the form
}
else if (deleting existing word) {
  delete the word
  redirect back to list
}

This is easy to do with "blocking" code. Functions return values, discrete input-output functions can be reused in multiple situations, the returned values can be evaluated, each step follows from the previous one. This is convenient but limits performance: in a high-traffic PHP-MySql application, this flow takes up a server process, and if the database is responding slowly under the load, the whole process waits; concurrent processes quickly hog all the server's memory, and a bottleneck in one part of the stack stalls the whole application. In node, the rest of the operations in the "event loop" continue to run, waiting patiently for the database (or any other I/O) callback to respond.

Coding that way is not so easy, however. If you try to load the word, for instance, you run the query with an asynchronous callback. There is no return statement on the query function. The rest of the code has be nested inside that callback, or else the code will keep running and will never get the response. So that bit would look more like this:

load the word ( function(word) {
  populate the form
  render the form
});

But deeply nested code isn't as intuitive as linear code, and it can make function portability very difficult. Suppose you have to run 10 database queries to populate the form - nesting them all inside each other gets very messy, and what if the logic needs to be more conditional, requiring a different nesting order in different cases?

There are ways of handling these problems, of course, but I'm just starting to learn them. In the case of the simple "load the word" scenario, Express offers the app.param construct, which parses parameters in the URL before executing the route callback. So the :word token tells the app to load a word with a given ID into the request object, then it renders the form.

No more ignoring POST and GET

In PHP, if there's a form on a page, the same piece of code processes the page whether its HTTP method is POST or GET. The $_REQUEST array even combines their parameters. Express doesn't like that, however - there is an app.all() construct that ignores the method, but the framework seems to prefer separate app.get() and app.post() routing. (There's apparently some controversy/confusion over the additional method PUT, but I steered clear of that for now.)

Back to the "word form" scenario: I load the form with GET, but submit the form with POST. That's two routes with essentially duplicate code. I could simply save an entry on POST, or render the form with GET - but what if I want to validate the form, then it needs to render the form when a POST fails validation - so it quickly becomes a mess. Express offers some solutions for this - res.redirect('back') goes back to the previous URL - but that seems like a hack that doesn't suit every situation. You can see how I handled this here, but I haven't yet figured out the best general approach to this problem.

New code needs a restart

In a PHP application, you can edit or deploy the code directly to the webroot, and as soon as it's saved, the next request uses it. With node, however, the javascript is loaded into memory when the app is run using the node command, and it runs the same code until the application is restarted. In its simplest use, this involves a Ctrl+C to stop and node app.js to restart. There are several pitfalls here:

  • Sessions (and any other in-app memory items) are lost every time you restart. So anyone using your app is suddenly logged out. For sessions, this is resolved with a database or other external session store; I can imagine other scenarios where this would be more challenging.
  • An uncaught runtime bug can crash the app, and if it's running autonomously on a server, there's nothing built-in to keep it running. One approach to this is a process manager; I'm using forever, which was built especially for node, to keep processes running and restart them easily when I deploy new code. Others have built tools within Node that abstract an individual app's process through a separate process-managing app.

When should the database connect?

Node's architectural philosophy suggests that nothing should be loaded until it's needed. A database connection might not be needed on an empty form, for instance - so it makes sense to open a database connection per request, and only when needed. I tried this approach first, using a "route middleware" function to connect on certain requests, and separated the database handling into its own module. That failed when I wanted to keep track of session IDs with MongoDB (using connect-mongo) - because a database connection is then needed on every request, and the examples all opened a connection at the top of the app, in the global scope. I switched to the latter approach, but I'm not sure which way is better.

Javascript can get very complicated

  • As logic flows through nested callbacks, variable scope is constantly changing. var and this have to be watched very carefully.
  • Writing functions that work portably across use cases without simple return statements is tricky. (One nice Node convention that covers many of these scenarios is the callback(error, result) concept, allowing calling functions to know if the result came back successfully in a standard way.)
  • Passing logic flow across node's "modules" is also tricky. Closures are helpful here, passing the app object to route modules, for instance. But in many cases, it wasn't clear how to divide the code in a way that was simultaneously logical, preserved variable scope, and worked portably with callbacks.
  • Everything - functions, arrays, classes - is an object. Class inheritance is done by instantiating another class/object and then modifying the new object's prototype. The same object can have the equivalent of "static" functions (by assigning them directly to the object) or instantiated methods (by assigning them to prototype). It's easy to get confused.
  • Javascript is a little clunky with handling empty values. The standard approach still seems to be if (typeof x == "undefined") which, at the very least, is a lot of code to express if (x). I used Underscore.js to help with this and other basic object manipulation shortcuts.
  • Because Express processes the request until there's a clear end to the response, and because everything is asynchronous, it's easy to miss a scenario in the flow where something unpredictable happens, no response is sent, and the client/user's browser simply hangs waiting for a response. I don't know if this is bad on the node side - the hanging request probably uses very little resources, since it's not actively doing anything - but it means the code has to handle a lot of possible error scenarios. Unlike in blocking code, you can't just put a catch-all else at the end of the flow to handle the unknown.

What my Flashcards app does now

The Spanish Flashcards app currently allows words (with English, Spanish, and part of speech) to be entered, shown in a list, put into groups, and randomly cycled with only one side shown, as a flashcard quiz.
The app also integrates with the WordReference API to lookup a new word and enter it - however, as of now, there's a bug in the English-Spanish API that prevents definitions from being returned. So I tested it using the English-French dictionary, and hope they'll fix the Spanish one soon.
It's built now to require login, with a single password set in a plain-text config.js file.

Next Steps for the app

I'd like to build out the flashcard game piece, so it remembers what words have been played, lets the player indicate if he got the answer right or wrong, and prioritizes previously-wrong or unseen words over ones that the player already knows.

Where I want to go with Node.js

I've been working primarily with Drupal for several years, and I want to diversify for a number of reasons: I've become very frustrated with the direction of Drupal core development, and don't want all my eggs in that basket. Web applications are increasingly requiring real-time, high-concurrency, noSQL infrastructure, which Node is well-suited for and my LAMP/Drupal skillset is not. And maybe most importantly, I find the whole architecture of Node to be fascinating and exciting, and the open-source ecosystem around it is growing organically, extremely fast, and seemingly without top-down direction.

Some resources that helped me

(All of these and many more are in my node.js tag on Delicious.)

  • Nodejitsu docs - tutorials on conventions and some best practices.
  • Victor Kane's node introand Lit app, which taught me a lot about CRUD best practices.
  • The API documentation for node.js, connect, and express.js.
  • HowToNode - seems like a generally good node resource/blog.
  • NPM, the Node Package Manager, is critical for sharing portable components, and serves as a central directory of Node modules.
  • 2009 talk by Ryan Dahl, the creator of Node.js, introducing the framework.
  • Forms and express-form, two libraries for handling form rendering and/or validation. (I tried the former and decided not to use it, but they try to simplify a very basic problem.)

Check out the code for my Spanish Flashcards app, and if you're into Node yourself and want to learn more of it together, drop me a line!

Nov 14 '10 4:02am
Tags

Using MongoDB for Watchdog (logging) in Drupal

Much has been made of the MongoDB support in Drupal, mostly with Fields in Drupal 7, but with some limited backporting to D6 as well. MongoDB is a member of the new "NoSQL" family of databases, potentially offering better performance and scalability than relational databases like MySQL.

I've wanted to start using MongoDB for a while, so Watchdog (Drupal's error logging API) - being a particularly MySql-intensive process - seems like a good start. (One solution is to use syslog instead of MySql for watchdog logging, but then there's no backend UI to view the logs.)

The caveat is that the MongoDB module for Drupal 6 is very limited and officially no longer supported. The watchdog module in the latest release (not updated since July) seems to log events correctly, but had no page to view individual event details (so only truncated message summaries were available).

I contributed event pages as a patch here. I also submitted another patch to make the variable names in the module more flexible and consistent.

We'll see how this works. My MongoDB experience is very limited, so if this becomes a high-maintenance part of the site in the short term, I'll have to turn it off. I'm hoping it just works for now, and now that the MongoDB database is running, I'll start using it for mini-apps within the site that don't need MySQL or could benefit from Mongo's performance.

Oct 25 '10 8:49pm

DevSeed's Data Browser built on Node.js and Mongo

It's a beautiful thing to see Development Seed, one of the most development-intensive Drupal shops, branching into Node.js and MongoDB, entirely away from Drupal.

In fact, I'm going to set up a Node.js server right now.