Making a Scraping App

March 25, 2016 | Project

I listen to podcasts when I’m driving to and from work, and one of my favorites (mostly for the people he interviews) is the Tim Ferriss Show. He recently did a fantastic interview with Cal Fussman, a well-seasoned writer for Esquire magazine. When asked for his advice after 35+ years of writing to someone just starting out, Cal’s advice was simple:

“Just write. Write and keep writing. Don’t say ‘I need school to make me write’. You make you write”

In the same way, let’s just Node; or as Shoptalk Show likes to say “just. build. websites”.

The Mission

March Madness is hot & heavy right now, and although I’m not a ‘4x4-driving, college-sports-loving, football-tailgating, bud-light-drinking’ type of a guy, a few high school friends and I have a pool going, and we like to smack talk throughout March. So I decided to make a site that redesigned the interface to something clearer, and more March-like. No real point, but some fun relevance. Oh yeah, it’s gotta be free too.

The Plan

Use Node to scrape ESPN’s site every ten minutes, find standings, return bracket names, user names, what percent each bracket score is, and use a database to cache calls, then update and render a site that reformats the data how I want it.

All the Things

Node offers tons of flexibility, which is why it’s so powerful, but getting started with all of these options can be overwhelming. After some trial and error, I found that using the skeleton express example provided by the express application generator is a good place to start.

Anything blocking in Node is a no-no. The Node replacement for this is a callback, or saying, “when we’re at this point in the code, call this function”. As I mentioned earlier, callbacks are super confusing, and you can easily get yourself in spaghetti trouble if chaining them. A lot of people use a Node module called q to help manage their promises/callbacks. I was able to keep things somewhat shallow using traditional callbacks, but as you’ll see, it’s a bit headache-inducing.

Okay, let’s do it

Prior to this weekend, my Node experience stopped at what I needed to know to use Grunt. My server knowledge is also fairly basic, so this is by no means an expert example. Get started by pulling down the repo for this Node app here.

The Structure

App.js - this is our main application Javascript, it starts the party
Procfile - this is what tells Heroku what to run
src/ - our client side source sass & js
views/ - jade files & template that generates our HTML and integrates data
routes/ - since we’re a single page, our main route file, passes data into templates
public/ - all of our generated CSS/JS and static files
modules/ - functions/modules used in the Node app.

Installing a new package

If you haven’t yet, check out one of the many Node hello world tutorials. Those help to understand the very basics of Node.js.

Running the command npm install [package name] --save does the trick. It’ll add a dependency to your package.json, but in order to actually use it (for instance in your app.js), you need to assign it to a variable and require it before using it, like so: var [package name] = require('[package name]');

The Stack

We’ll be using the tools below to accomplish our mission. It’s pretty close to what the cool kids call a MEAN stack (Mongo, Express, Angular, Node) which is on opposite end of the ring as the (perhaps more familiar) LAMP stack (Linux, Apache, MySQL, and PHP - think Wordpress). The specific tools we’re using have some really swanky names, check it out:

Node.js - Replaces Apache, as Javascript that runs on a server. The key difference between node and js written for the browser is that Node is all async (and obviously there’s no DOM to interact with). Which means you need to make sure you’re using callbacks (more on that later).
Heroku - worry free hosting. Seriously, two commands to add Heroku as a remote to your git repository, and you’re good to go. Also, a reasonable free hosting tier (they charge when you scale things).
Cheerio, Express, PhantomJS - All the Node packages we’ll use. Cheerio is jQuery for Node, express is the most popular web framework for Node, and phantom is what we’ll use to scrape our webpage and return data.
Mongodb is a NoSQL database. This basically means we can format our data however we want (in this case, basically JSON), and push it up. I used mLab’s MongoDB add-on because it’s got a free tier and it’s easy to use.
Jade is an HTML templating language that seems to be the most popular with Node folks, mostly because it pairs so nicely. It’s white-space aware = 👎, but has a really shallow learning curve = 👍. I tried using handlebars instead, and it turned out to have less documentation, and just added another layer of unnecessary complexity for just starting out.

*Side note: I originally followed this tutorial, which uses a Node package called request instead of phantom. Turns out to be great if you’re parsing static HTML, but if your page is generated client-side via JS (AJAX or otherwise), you’ll want to make the call using phantom, which will render the page first, then scrape.

What’s happening

The app.js sets things up, initiating routes and environments. It sets up the main route to routes/index.js which imports and calls cachedata.js, and uses it’s returned values to render a Jekyll view for the page.

The cacheData module (modules/cacheData.js) makes a call to the database, pulling down all of the data and checking if the timestamp field is more than 10 min different than the current time. If it is, it calls getdata, if it isn’t it returns the data from the intial call back to the index route.

The getData module uses phantom to hit, render and scrape the URL provided. Upon receipt, it’ll convert the data to an array of objects, and return that back to cacheData, which in tern returns the data back to index.js.

Get Codin’

Since we’re essentially running a single page app, most of the action will need to be loaded into routes/index.js. To break things out into modules, you can essentially plop your function into a separate file (what I did with cachedata.js and getData.js), then require the module like this:

var getData = require('../modules/getdata.js'). This allows you to then run the function as if it was written in the same file. Be sure to export the function at the end of the file (for an example, check out getData.js).

Scrape responsibly

This example has minimal footprint, hitting the ESPN servers at most once every 10 minutes, and caching the results. Scraping isn’t illegal (although can be against the terms and condition of a site), but can be frowned upon especially if it’s done carelessly, so be sure you’re cachin’ what you’re scrapin’.

Check out the live version up on Heroku.