Liberate your Drupal data for a service-oriented architecture (using Redis, Node.js, and MongoDB)

April 29, 2012

Drupal’s basic content unit is a “node,” and to build a single node (or to perform any other Drupal activity), the codebase has to be bootstrapped, and everything needed to respond to the request (configuration, database and cache connections, etc) has to be initialized and loaded into memory from scratch. Then node_load runs through the NodeAPI hooks, multiple database queries are run, and the node is built into a single PHP object.

This is fine if your web application runs entirely through Drupal, and always will, but what if you want to move toward a more flexible Service-oriented architecture (SOA), and share your content (and users) with other applications? For example, build a mobile app with a Node.js backend like LinkedIn did; or calculate analytics for business intelligence; or have customer service reps talk to your customers in real-time; or integrate with a ticketing system; or do anything else that doesn’t play to Drupal’s content-publishing strengths. Maybe you just want to make your data (which is the core of your business, not the server stack) technology-agnostic. Maybe you want to migrate a legacy Drupal application to a different system, but the cost of refactoring all the business logic is prohibitive; with an SOA you could change the calculation and get the best of both worlds.

The traditional way of doing this was setting up a web service in Drupal using something like the Services module. External applications could request data over HTTP, and Drupal would respond in JSON. Each request has to wait for Drupal to bootstrap, which uses a lot of memory (every enterprise Drupal site I’ve ever seen has been bogged down by legacy code that runs on every request), so it’s slow and doesn’t scale well. Rather than relieving some load from Drupal’s LAMP stack by building a separate application, you’re just adding more load to both apps. To spread the load, you have to keep adding PHP/Apache/Mysql instances horizontally. Every module added to Drupal compounds the latency of Drupal’s hook architecture (running thousands of function_exists calls for example), so the stakeholders involved in changing the Drupal app has to include the users of every secondary application requesting the data. With a Drupal-Services approach, other apps will always be second-class citizens, dependent on the legacy system, not allowing the “loose coupling” principle of SOA.

I’ve been shifting my own work from Drupal to Node.js over the last year, but I still have large Drupal applications (such as Antiques Near Me) which can’t be easily moved away, and frankly don’t need to be for most use cases. Overall, I tend to think of Drupal as a legacy system, burdened by too much cruft and inconsistent architecture, and no longer the best platform for most applications. I’ve been giving a lot of thought to ways to keep these apps future-proof without rebuilding all the parts that work well as-is.

That led me to build what I’ve called the “Drupal Liberator”. It consists of a Drupal module and a Node.js app, and uses Redis (a very fast key-value store) for a middleman queue and MongoDB for the final storage. Here’s how it works:

When a node (or user, or other entity type) is saved in Drupal, the module encodes it to JSON (a cross-platform format that’s also native to Node.js and MongoDB), and puts it, along with metadata (an md5 checksum of the JSON, timestamp, etc), into a Redis hash (a simple key-value object, containing the metadata and the object as a JSON string). It also notifies a Redis pub/sub channel of the new hash key. (This uses 13KB of additional memory and 2ms of time for Drupal on the first node, and 1KB/1ms for subsequent node saves on the same request. If Redis is down, Drupal goes on as usual.)
The Node.js app, running completely independently of Drupal, is listening to the pub/sub channel. When it’s pinged with a hash key, it retrieves the hash, JSON.parse‘s the string into a native object, possibly alters it a little (e.g., adding the checksum and timestamp into the object), and saves it into MongoDB (which also speaks JSON natively). The data type (node, user, etc) and other information in the metadata directs where it’s saved. Under normal conditions, this whole process from node_save to MongoDB takes less than a second. If it were to bottleneck at some point in the flow, the Node.js app runs asynchronously, not blocking or straining Drupal in any way.
For redundancy, the Node.js app also polls the hash namespace every few minutes. If any part of the mechanism breaks at any time, or to catch up when first installing it, the timestamp and checksum stored in each saved object allow the two systems to easily find the last synchronized item and continue synchronizing from there.

The result is a read-only clone of the data, synchronized almost instantaneously with MongoDB. Individual nodes can be loaded without bootstrapping Drupal (or touching Apache-MySql-PHP at all), as fully-built objects. New apps utilizing the data can be built in any framework or language. The whole Drupal site could go down and the data needed for the other applications would still be usable. Complex queries (for node retrieval or aggregate statistics) that would otherwise require enormous SQL joins can be built using MapReduce and run without affecting the Drupal database.

One example of a simple use case this enables: Utilize the CMS backend to edit your content, but publish it using a thin MongoDB layer and client-side templates. (And outsource comments and other user-write interactions to a service like Disqus.) Suddenly your content displays much faster and under higher traffic with less server capacity, and you don’t have to worry about Varnish or your Drupal site being “Slashdotted“.

A few caveats worth mentioning: First, it’s read-only. If a separate app wants to modify the data in any way (and maintain data integrity across systems), it has to communicate with Drupal, or a synchronization bridge has to be built in the other direction. (This could be the logical next step in developing this approach, and truly make Drupal a co-equal player in an SOA.)

Second, you could have Drupal write to MongoDB directly and cut out the middlemen. (And indeed that might make more sense in a lot of cases.) But I built this with the premise of an already strained Drupal site, where adding another database connection would slow it down even further. This aims to put as little additional load on Drupal as possible, with the “Liberator” acting itself as an independent app.

Third, if all you needed was instant node retrieval - for example, if your app could query MySql for node ID’s, but didn’t want to bootstrap Drupal to build the node objects - you could leave them in Redis and take Node.js and MongoDB out of the picture.

I’ve just started exploring the potential of where this can go, so I’ve run this mostly as a proof-of-concept so far (successfully). I’m also not releasing the code at this stage: If you want to adopt this approach to evolve your Drupal system to a service-oriented architecture, I am available as a consultant to help you do so. I’ve started building separate apps in Node.js that tie into Drupal sites with Ajax) and found the speed and flexibility very liberating. There’s also a world of non-Drupal developers who can help you leverage your data, if it could be easily liberated. I see this as opening a whole new set of doors for where legacy Drupal sites can go.

BenBuckman

Software architect

Liberate your Drupal data for a service-oriented architecture (using Redis, Node.js, and MongoDB)