Tech Blog :: antiquesnearme


May 22 '12 9:58pm

Using Node.js to connect with the eBay APIs on AntiquesNearMe.com

We recently rolled out a new space on Antiques Near Me called ANM Picks, which uses eBay's APIs to programmatically find high-quality antique auctions using the same metrics that Sean (my antique-dealer business partner) uses in his own business. We'll cover the product promotion elsewhere, but I want to write here about how it works under the hood.

The first iteration a few weeks ago simply involved an RSS feed being piped into the sidebar of the site. The site is primarily a Drupal 6 application, and Drupal has tools for handling feeds, but they're very heavy: They make everything a "node" (Drupal content item), and all external requests have to be run in series using PHP cron scripts on top of a memory-intensive Drupal process - i.e. they're good if you want to pull external content permanently into your CMS, but aren't suited for the kind of ephemeral, 3rd-party data that eBay serves. So I built a Node.js app that loaded the feed periodically, reformatted the items, and served them via Ajax onto the site.

The RSS feed gave us very limited control over the filters, however. To really curate the data intelligently, I needed to use eBay's APIs. They have dozens of APIs of varying ages and protocols. Fortunately for our purposes, all the services we needed spoke JSON. No single API gave us all the dimensions we needed (that would be too easy!), so the app evolved to make many requests to several different APIs and then cross-reference the results.

Node.js is great for working with 3rd-party web services because its IO is all asynchronous. Using async, my favorite Node.js flow-control module, the app can run dozens of requests in parallel or in a more finely-tuned queue. (Node.js obviously didn't invent asynchronous flow control - Java and other languages can spawn off threads to do this - but I doubt they do it as easily from a code perspective.)

To work with the eBay APIs, I wrote a client library which I published to npm (npm install ebay-api). Code is on Github. (Ironically, someone else published another eBay module as I was working on mine, independently; they're both incomplete in different ways, so maybe they'll converge eventually. I like that in the Node.js ecosystem, unlike in Drupal's, two nutcrackers are considered better than one.) The module includes a ebayApiGetRequest method for a single request, paginateGetRequest for a multi-page request, a parser for two of the services, a flatten method that simplifies the data structure (to more easily query the results with MongoDB), and some other helper functions. There are examples in the code as well.

Back to my app: Once the data is mashed up and filtered, it's saved into MongoDB (using Mongoose for basic schema validation, but otherwise as free-form JSON, which Mongo is perfect for). A subset is then loaded into memory for the sidebar Favorites and ANM Picks. (The database is only accessed for an individual item if it's no longer a favorite.) All this is frequently reloaded to fetch new items and flush out old ones (under the eBay TOS, we're not allowed to show anything that has passed).

The app runs on a different port from the website, so to pipe it through cleanly, I'm using Varnish as a proxy. Varnish is already running in front of the Drupal site, so I added another backend in the VCL and activate it based on a subdomain. Oddly, trying to toggle by URL (via req.url) didn't work - it would intermittently load the wrong backend without a clear pattern - so the subdomain (via `req.http.host) was second best. It was still necessary to deal with Allow-Origin issues, but at least the URL scheme looked better, and the cache splits the load.

Having the data in MongoDB means we can multiply the visibility (and affiliate-marketing revenue) through social networks. Each item now has a Facebook Like widget which points back to a unique URL on our site (proxied through Drupal, with details visible until it has passed). The client-side JS subscribes to the widgets' events and pings our app, so we can track which items and categories are the most popular, and (by also tracking clicks) make sure eBay is being honest. We're tuning the algorithm to show only high-quality auctions, so the better it does, the more (we hope) they'll be shared organically.

Comments? Questions? Interested in using the eBay APIs with Node.js? Feel free to email me or comment below.

Dec 12 '11 3:52pm

Making sense of Varnish caching rules

Varnish is a reverse-proxy cache that allows a site with a heavy backend (such as a Drupal site) and mostly consistent content to handle very high traffic load. The “cache” part refers to Varnish storing the entire output of a page in its memory, and the “reverse proxy” part means it functions as its own server, sitting in front of Apache and passing requests back to Apache only when necessary.

One of the challenges with implementing Varnish, however, is the complex “VCL” protocol it uses to process requests with custom logic. The syntax is unusual, the documentation relies heavily on complex examples, and there don’t seem to be any books or other comprehensive resources on the software. A recent link on the project site to Varnish Training is just a pitch for a paid course. Searching more specifically for Drupal + Varnish will bring up many good results - including Lullabot’s fantastic tutorial from April, and older examples for Mercury - but the latest stable release is now 3.x and many of the examples (written for 2.x) don’t work as written anymore. So it takes a lot of trial and error to get it all working.

I’ve been running Varnish on AntiquesNearMe.com, partly to keep our hosting costs down by getting more power out of less [virtual] hardware. A side benefit is the site’s ability to respond very nicely if the backend Apache server ever goes down. They’re on separate VPS's (connected via internal private networking), and if the Apache server completely explodes from memory overload, or I simply need to upgrade a server-related package, Varnish will display a themed “We’re down for a little while” message.

But it wasn’t until recently that I got Varnish’s primary function, caching, really tuned. I spent several days under the hood recently, and while I don’t want to rehash what’s already been well covered in Lullabot’s tutorial, here are some other things I learned:

Check syntax before restarting

After you update your VCL, you need to restart Varnish - using sudo /etc/init.d/varnish restart for instance - for the changes to take effect. If you have a syntax error, however, this will take down your site. So check the syntax first (change the path to your VCL as needed):
varnishd -C -f /etc/varnish/default.vcl > /dev/null

If there are errors, it will display them; if not, it shows nothing. Use that as a visual check before restarting. (Unfortunately the exit code of that command is always 0, so you can’t do check-then-restart as simply as check-varnish-syntax && /etc/init.d/varnish restart, but you could grep the output for the words “exit 1” to accomplish the same.)

Logging

The std.log function allows you to generate arbitrary messages about Varnish’s processing. Add import std; at the top of your VCL file, and then std.log("DEV: some useful message") anywhere you want. The “DEV” prefix is an arbitrary way of differentiating your logs from all the others. So you can then run in the shell, varnishlog | grep "DEV" and watch only the information you’ve chosen to see.

How I use this:
- At the top of vcl_recv() I put std.log("DEV: Request to URL: " + req.url);, to put all the other logs in context.
- When I pipe back to apache, I put std.log("DEV: piping " + req.url + " straight back to apache"); before the return (pipe);
- On blocked URLs (cron, install), the same
- On static files (images, JS, CSS), I put std.log("DEV: Always caching " + req.url);
- To understand all the regex madness going on with cookies, I log req.http.Cookie at every step to see what’s changed.

Plug some of these in, check the syntax, restart Varnish, run varnishlog|grep PREFIX as above, and watch as you hit a bunch of URLs in your browser. Varnish’s internal logic will quickly start making more sense.

Watch Varnish work with your browser

Varnish headers in Chrome Inspector
The Chrome/Safari Inspector and Firebug show the headers for every request made on a page. With Varnish running, look at the Response Headers for one of them: you’ll see “Via: Varnish” if the page was processed through Varnish, or “Server:Apache” if it went through Apache. (Using Chrome, for instance, login to your Drupal site and the page should load via Apache (assuming you see page elements not available to anonymous users), then open an Incognito window and it should run through Varnish.)

Add hit/miss headers

  • When a page is supposed to be cached (not pipe'd immediately), Varnish checks if there is an existing hit or miss. To watch this in your Inspector, use this logic:
sub vcl_deliver {
  std.log("DEV: Hits on " + req.url + ": " + obj.hits);
 
  if (obj.hits > 0) {
    set resp.http.X-Varnish-Cache = "HIT";
  }
  else {
    set resp.http.X-Varnish-Cache = "MISS";
  }
 
  return (deliver);
}

Then you can clear the caches, hit a page (using the browser technique above), see “via Varnish” and a MISS, hit it again, see a HIT (or not), and know if everything is working.

Clear Varnish when aggregated CSS+JS are rebuilt

If you have CSS/JS aggregation enabled (as recommended), your HTML source will reference long hash-string files. Varnish caches that HTML with the hash string. If you clear only those caches (“requisites” via Admin Menu or cc css+js via Drush), Varnish will still have the old references, but the files will have been deleted. Not good. You could simply never use that operation again, but that’s a little silly.

The heavy-handed solution I came up with (I welcome alternatives) is to wipe the Varnish cache when CSS+JS resets. That operation is not hook-able, however, so you have to patch core. In common.js, _drupal_flush_css_js(), add:

if (module_exists('varnish') && function_exists('varnish_purge_all_pages')) {
  varnish_purge_all_pages();
}

This still keeps Memcache and other in-Drupal caches intact, avoiding an unnecessary “clear all caches” operation, but makes sure Varnish doesn’t point to dead files. (You could take it a step further and purge only URLs that are Drupal-generated and not static; if you figure out the regex for that, please share.)

Per-page cookie logic

On AntiquesNearMe.com we have a cookie that remembers the last location you searched, which makes for a nicer UX. That cookie gets added to Varnish’s page “hash” and (correctly) bypasses the cache on pages that take that cookie into account. The cookie is not relevant to the rest of the site, however, so it should be ignored in those cases. How to handle this?

There are two ways to handle cookies in Varnish: strip cookies you know you don’t want, as in this old Pressflow example, or leave only the cookies you know you do want, as in Lullabot’s example. Each strategy has its pros and cons and works on its own, but it’s not advisable to combine them. I’m using Lullabot’s technique on this site, so to deal with the location cookie, I use if-else logic: if the cookie is available but not needed (determined by regex like req.url !~ "PATTERN" || ...), then strip it; otherwise keep it. If the cookie logic you need is more varied but still linear, you could create a series of elsif statements to handle all the use cases. (Just make sure to roast a huge pot of coffee first.)

Useful add-ons to varnish.module

  • Added watchdog('varnish', ...) commands in varnish.module on cache-clearing operations, so I could look at the logs and spot problems.
  • Added a block to varnish.module with a “Purge this page” button for each URL, shown only for admins. (I saw this in an Akamai module and it made a lot of sense to copy. I’d be happy to post a patch if others want this.)
  • The Expire offers plug-n-play intelligence to selectively clear Varnish URLs only when necessary (clearing a landing page of blog posts only if a blog post is modified, for example.) Much better than the default behavior of aggressive clearing “just in case”.

I hope this helps people adopt Varnish. I am also available via my consulting business New Leaf Digital for paid implementation, strategic advice, or support for Varnish-aided sites.

Oct 30 '11 7:21pm

Tilt, 3D Dom Inspector for Firefox

HTML pages appear in the browser in 2 dimensions, but there are actually 2 additional dimensions to the DOM: the hierarchy of elements, and the z-index. A new DOM inspector for Firefox called Tilt displays the DOM in 3 dimensions, showing the hierarchy of elements (not sure about the z-index). This is what the homepage of AntiquesNearMe.com looks like in 3D. Pretty cool.

Aug 21 '11 4:02pm

Brainstorming: Building an advertising system for AntiquesNearMe.com

One of our first revenue-generating features for Antiques Near Me (a startup antique sale-finding portal which I co-founded) is basic sponsorship ads, which we've simply been calling "featured listings." On the Boston portal, for example, the featured listing would be an antiques business in Boston, displayed prominently, clearly a sponsor, but organic (not a spammy banner ad).

I've been brainstorming how to build it, and the options span quite a range. I'll lay out some of my considerations:

  • The primary stack running the site is Drupal 6 in LAMP, cached behind Varnish to prevent server-side bottlenecks. We could build the whole system in Drupal, with Drupal-based ecommerce to sell it, and render the ads as part of the page. But if advertisers want to see stats (e.g. how many impressions/clicks has their sponsorship generated), a server-side approach has no single place to track impressions.
  • The ad placement logic doesn't have to be fancy - we want the sponsorships to be exclusive for a given time period - so we don't need all the fancy math of DFP or OpenX for figuring out what ad to place where. But the system will eventually need to handle variable pricing, variable time frames, potential "inventory" to check for availability, and other basic needs of an ad system.
  • We're running AdSense ads through Google's free DFP service, so we could set up placements and ad units for each sponsor there. But that's a manual process, and we want the ad "real estate" to scale (eventually for each city and antiques category); so in the long-run it has to be automated. That requires DFP API integration. I've signed up for access to that API, and the PHP library looks robust, but the approval process is opaque, and I'm not sure this is the right approach.
  • A hybrid Drupal-DFP approach, with flexible ad placements in DFP and client-side variables passed in from Drupal to differentiate the stats, sounds nice. But it's not clear if this is feasible; information I've gotten from a big-biz AdOps guy suggests it's not with the free edition.
  • I could build a scalable, in-house, back-end solution using Node.js and MongoDB. In theory this can handle a lot more concurrent traffic (each request being very small and quick) than Drupal/LAMP. Mongo is already in use on the site and I've wanted to learn Node for a while. This would require learning Node well enough to deploy this comfortably; with a custom bridge between Drupal (still handling the UI and transactions) and Node. This could take a while to roll out, and adds another moving piece to an already complex stack.
  • Maybe there's another 3rd party off-the-shelf service to handle this, that could be easily bridged with Drupal?

I'm curious how other sites handle similar requirements. Any ideas?