How to Scrape a Website with Cheerio

— 7 minute read


I recently launched a daft wee side-project to show people all the Number 1 songs on their birthday. To collect the data I had to scrape the Official Charts website as there was no official API available.

This article is going to explain how we can use Cheerio to scrape the site and gather all the information we need.

The full code for the site is available on GitHub

Requirements

We're trying to grab the following 4 pieces of info (highlighted in the image below):

  • the date of the chart
  • the number 1 song name
  • the number 1 artist
  • the URL to the cover image

From the page which looks like this:

A screenshot of https://www.officialcharts.com/charts/singles-chart/20200101 with the date, number 1 song and artist, and cover image highlighted

What is Cheerio?

Cheerio is a DOM manipulation library, essentially taking the jQuery interface and allowing it to be used on the server. This means that we can load in an HTML page and use jQuery like syntax to extract the bits of it we want.

What is Cheerio not?

Cheerio isn't a web browser and so doesn't interpret the page that is loaded in. It won't execute JavaScript or load external resources. It simply takes the HTML markup you give it, parses it and then lets you fiddle with it. For more complex web scraping projects, ones that require automated interactions and need the page to be loaded and rendered as it would be in a regular browser, then you would be better suited something like Puppeteer.

Why Cheerio over Puppeteer then?

Since all the info we want is in the raw HTML of the Official Charts page we don't need any interactions with the page. Using Cheerio to just parse this HTML is blazing fast and super simple.

To the Code!

First let's install the two dependencies that we need - axios and cheerio. We use axios to make our HTTP request to the Official Charts site to grab the HTML that we will parse. The cheerio package is then used to parse that HTML and provide us the interface to query it.

npm install axios cheerio --save-dev

Now that they are installed we can require them both into our script.

const axios = require('axios');
const cheerio = require('cheerio');

Now we can make a request to our website and grab the response. In my case I was hitting a URL like this: https://www.officialcharts.com/charts/singles-chart/20200101 which shows the chart listings for that date. We use axios's get method and await for the response.

const axios = require('axios');
const cheerio = require('cheerio');

const url = 'https://www.officialcharts.com/charts/singles-chart/20200101';
const response = await axios.get(url);

The response object has a data property which contains the markup that was returned. We use cheerio's load method to load in the HTML and we're ready to start querying it.

const axios = require('axios');
const cheerio = require('cheerio');

const url = 'https://www.officialcharts.com/charts/singles-chart/20200101';
const response = await axios.get(url);

const $ = cheerio.load(response.data);

We give variable holding the parsed page the $ name so it is short and jQuery like.

Extracting The Date

We want to get the start date for this set of songs, which is contained in the following markup:

<p class="article-date">
27 December 2019 - 02 January 2020
</p>

So we select it like we would with regular jQuery, using a CSS selector:

const date = $('.article-date');

We can then extract the text content from it (the text() method), trim the whitespace off (trim()) and split it by the hyphen to get the start date (split('-')).

const date = $('.article-date');

const startEndDate = date.text().trim().split('-');

const startDate = startEndDate[0].trim();

Finding the Song and Artist Name

Due to the structure of the HTML the number 1 song is not as simple to find. We need to use a more complex selector to find the first row in the table.

  const track = $('.chart tr.headings + tr .track .title-artist .title').text().trim();

const artist = $('.chart tr.headings + tr .track .title-artist .artist').text().trim();

Again we use the text() and trim() methods to get the text content and then tidy it up.

Getting the Cover Image

So far we've just grabbed text content inside elements, but just as jQuery can we can grab attributes from elements too. We want to get the URL to the cover image for the song so we first select the img element in question.

  const cover = $('.cover img');

Then we use the attr method to get the src value. Easy!

  const cover = $('.cover img').attr('src');

Taking it Further

That's a very quick intro to working with Cheerio and how I used it on the Birthday Number Ones project. If you're familiar with jQuery then you will not have trouble working with Cheerio. The docs are really comprehensive and have examples of just about every scenario you might need.

Happy scraping!

Filed under