The Google Chrome team released a web scraping tool called Puppeteer in the not-so-distant past and I’ve been keening to play with it. I wrote a small script that uses it. If you’d like to follow the post with the source code, it is located here.
What exactly is Puppeteer?
Essentially it’s a web browser (Chromium) controlled by an API library. You write some code
and that will control a browser to do things for you. When starting out I kept
hearing it being called a headless
browser. Headless in this case refers to the
fact that it runs without actually opening a browser interface to look at. I’m
glossing over a lot of details, but that’s the basics of it.
Fake album covers
There is a subreddit I like to visit called r/fakealbumcovers where people post fake album covers based on images or phrases they’ve seen in the wild. (Ever had the thought: “This would make a great band/album name?” Same thing.) Some time ago I had the great idea of downloading my favorite ones and adding them to a collection and displaying them as part of a screensaver album. However, I found myself doing a lot of manual work so I wanted to automate this process! Enter Puppeteer.
Getting started
The quick start documentation
for Puppeteer is great. It is as simple as installing it using npm
or yarn
then require()
-ing it in our project. Now let’s navigate it to r/fakealbumcovers:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.reddit.com/r/fakealbumcovers/');
// ...
);
The example uses the new JS syntax of async/await
for asynchronous code. Briefly,
an async
function can contain an await
expression which will pause execution of
the async
function until the await
resolves.
Finding the image links
After launching Puppeteer and pointing it to r/fakealbumcovers, I needed to find a way for it to grab the image links. Then once we grab the image links we can find a way to download them all. Initially, I wrote code to click every single link and copy the URL. This… proved to be very error prone, so I had to find another way.
If we look at the source, it turns out that image links
are embedded as data-url
in each post (relevant bits only):
<div class="thing" id="thing_t3_7ggt1e" onclick="click_thing(this)"
...
data-url="https://i.imgur.com/WbA82ts.png"
data-permalink="/r/fakealbumcovers/comments/7ggt1e/dear_gog_im_doing/"
data-domain="i.imgur.com"
data-rank="2"
data-comments-count="6"
data-score="38"
...
>...</div>
Reddit is also nice about giving us a lot of information on each post – such as the permalink, its ranking on the subreddit, comment count, score, etc. This is all very useful information we can use to pick out what posts to scrape.
Grabbing the links
Now that we’ve found where each image link is located, it’s time to grab all of them. There are a few ways to go about this, but the simplest way I found was to trace the DOM to each post. It might also be helpful to see what Puppeteer is actually seeing and this can be done by passing an option to Puppeteer on launch:
puppeteer.launch({ headless: false })
This will launch a full version of Chromium which we can use to view/inspect the rendered page.
From viewing the HTML, we can see that all posts have the class thing
, so let’s
grab all of them and then parse out the image links into an array. We can have
Puppeteer run a script in the context of the page using page.evaluate()
:
const results = page.evaluate(() => {
const imageLinks = [];
const elements = document.querySelectorAll('.content > .spacer > #siteTable > .thing');
for (const el of elements) {
if (el.dataset.url.search(/(.jpg)|(.png)$/) >= 0) {
imageLinks.push(el.dataset.url);
}
}
return imageLinks;
});
A few things to note here:
- Using the browser developer tools allows us to see exactly what each node contains
by running
querySelectorAll()
in the console. - There were stickied posts at the top that did not have any image links, so I had to filter those out.
- It is also possible here to do additional filtering on post rank, score, etc. Perhaps only grabbing images with a score ≥ 500.
Downloading the images
At this point, we have the links to the images and it is just a matter of downloading
them. I initially began by writing my own code to download the images, but ultimately
found a small npm library that did
the same. The library makes downloading single images simple but we have more
than 20 images that we need to download. Fortunately, the library has support
for async/await
and we can use that to our advantage. We write an async
function
that will await
scraping all the image links, then await
downloading all the
images. This can be done by using Promise.all()
which takes the argument of an
array of promises. We can create this array of promises by mapping the image links
with a callback function that will await
the downloaded image, effectively returning
a promise.
const downloadAll = async () => {
const imgs = await scrapeImgUrls();
await Promise.all(imgs.map(async (file) => {
await downloadImg({
url: file,
dest: '../../../../Pictures/covers'
});
}));
console.log("Done.");
};
The second await
used with Promise.all()
is not necessary for the code to work
correctly, but is necessary if I want to print out the log statement after all the
images have been downloaded.
The only downside here is that I hardcoded the destination path where the images will be downloaded, and for my use case this was fine. However, if I would ever like to use this for something else, I would need to figure out a more appropriate solution.
Fin
And that’s it. This was a fun little project to learn to use Puppeteer. It took a little bit of trial and error finding the appropriate selector for each post but overall I’m happy with how this turned out. I hope you enjoyed this writeup as much as I enjoyed working on this project. Once again, the full source code is here. Let me know if you have any questions or comments.