Web Scraping With Puppeteer

The Google Chrome team released a web scraping tool called Puppeteer in the not-so-distant past and I’ve been keening to play with it. I wrote a small script that uses it. If you’d like to follow the post with the source code, it is located here.

What exactly is Puppeteer?

Essentially it’s a web browser (Chromium) controlled by an API library. You write some code and that will control a browser to do things for you. When starting out I kept hearing it being called a headless browser. Headless in this case refers to the fact that it runs without actually opening a browser interface to look at. I’m glossing over a lot of details, but that’s the basics of it.

Fake album covers

There is a subreddit I like to visit called r/fakealbumcovers where people post fake album covers based on images or phrases they’ve seen in the wild. (Ever had the thought: “This would make a great band/album name?” Same thing.) Some time ago I had the great idea of downloading my favorite ones and adding them to a collection and displaying them as part of a screensaver album. However, I found myself doing a lot of manual work so I wanted to automate this process! Enter Puppeteer.

Getting started

The quick start documentation for Puppeteer is great. It is as simple as installing it using npm or yarn then require()-ing it in our project. Now let’s navigate it to r/fakealbumcovers:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.reddit.com/r/fakealbumcovers/');
  // ...
);

The example uses the new JS syntax of async/await for asynchronous code. Briefly, an async function can contain an await expression which will pause execution of the async function until the await resolves.

Finding the image links

After launching Puppeteer and pointing it to r/fakealbumcovers, I needed to find a way for it to grab the image links. Then once we grab the image links we can find a way to download them all. Initially, I wrote code to click every single link and copy the URL. This… proved to be very error prone, so I had to find another way.

If we look at the source, it turns out that image links are embedded as data-url in each post (relevant bits only):

<div class="thing" id="thing_t3_7ggt1e" onclick="click_thing(this)" 
...
data-url="https://i.imgur.com/WbA82ts.png" 
data-permalink="/r/fakealbumcovers/comments/7ggt1e/dear_gog_im_doing/" 
data-domain="i.imgur.com" 
data-rank="2" 
data-comments-count="6" 
data-score="38" 
...
>...</div>

Reddit is also nice about giving us a lot of information on each post – such as the permalink, its ranking on the subreddit, comment count, score, etc. This is all very useful information we can use to pick out what posts to scrape.

Grabbing the links

Now that we’ve found where each image link is located, it’s time to grab all of them. There are a few ways to go about this, but the simplest way I found was to trace the DOM to each post. It might also be helpful to see what Puppeteer is actually seeing and this can be done by passing an option to Puppeteer on launch:

puppeteer.launch({ headless: false })

This will launch a full version of Chromium which we can use to view/inspect the rendered page.

From viewing the HTML, we can see that all posts have the class thing, so let’s grab all of them and then parse out the image links into an array. We can have Puppeteer run a script in the context of the page using page.evaluate():

const results = page.evaluate(() => {
  const imageLinks = [];
  const elements = document.querySelectorAll('.content > .spacer > #siteTable > .thing');

  for (const el of elements) {
    if (el.dataset.url.search(/(.jpg)|(.png)$/) >= 0) {
      imageLinks.push(el.dataset.url);
    }
  }
  return imageLinks;
});

A few things to note here:

Using the browser developer tools allows us to see exactly what each node contains by running querySelectorAll() in the console.
There were stickied posts at the top that did not have any image links, so I had to filter those out.
It is also possible here to do additional filtering on post rank, score, etc. Perhaps only grabbing images with a score ≥ 500.

Downloading the images

At this point, we have the links to the images and it is just a matter of downloading them. I initially began by writing my own code to download the images, but ultimately found a small npm library that did the same. The library makes downloading single images simple but we have more than 20 images that we need to download. Fortunately, the library has support for async/await and we can use that to our advantage. We write an async function that will await scraping all the image links, then await downloading all the images. This can be done by using Promise.all() which takes the argument of an array of promises. We can create this array of promises by mapping the image links with a callback function that will await the downloaded image, effectively returning a promise.

const downloadAll = async () => {
  const imgs = await scrapeImgUrls();

  await Promise.all(imgs.map(async (file) => {
    await downloadImg({
      url: file,
      dest: '../../../../Pictures/covers'
    });
  }));
  console.log("Done.");
};

The second await used with Promise.all() is not necessary for the code to work correctly, but is necessary if I want to print out the log statement after all the images have been downloaded.

The only downside here is that I hardcoded the destination path where the images will be downloaded, and for my use case this was fine. However, if I would ever like to use this for something else, I would need to figure out a more appropriate solution.

Fin

And that’s it. This was a fun little project to learn to use Puppeteer. It took a little bit of trial and error finding the appropriate selector for each post but overall I’m happy with how this turned out. I hope you enjoyed this writeup as much as I enjoyed working on this project. Once again, the full source code is here. Let me know if you have any questions or comments.