Scraping a website and downloading files with Node.js using Axios and Cheerio

Scraping a website and downloading files with Node.js using Axios and Cheerio

ยท

0 min read

Usually, when people mention web scraping, the first thing that comes into mind is Python. However, Node.js has various libraries that can perform web scraping. From basic Request method to more complex solutions like Puppeteer and Nightmare. But in this post, we will explore an alternative way using Axios and Cheerio library.

Axios

Promise based HTTP client for the browser and node.js

Cheerio

Fast, flexible & lean implementation of core jQuery designed specifically for the server.

One of the most important thing in writing codes is reading the documentation of the tools you are gonna use. If you have already done that then I assume you have understood what are the purposes of the above-mentioned tools. Basically, Axios lets us make HTTP requests from the node.js kinda like how we use fetch API and Cheerio lets us manipulating the response that we get from Axios (which basically contains the markup of a particular webpage).

Our Target...

books.goalkicker.com is a website where you can find lots of ebooks containing newbie friendly (this can be debatable) materials for many programming languages. Our aim is to scrap through the webpages and download all the PDFs available for download. So create a folder for your project. Inside it create an app.js and setup package.json.

Okay, let's get into coding...

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs')

We have already talked about the first two libaries above. From Node.js website -

The fs module provides an API for interacting with the file system in a manner closely modeled around standard POSIX functions.

const url = 'https://books.goalkicker.com/'
let linkList = []
let dlinkList = []

Here we assigned the link to a constant variable and also declared two array for using later.

(async () => {
  await getWebsiteLinks(url)
  await downloadLinks(linkList)
  await downloadFiles(dlinkList)
})()

Throughout the script, we will be using Async/Await as I prefer it over Callback Hell or bewildering Promise Chaining. However, Axios themselves use promise chaining in their documentation so you can use that too.

Here we wrote an asynchronous function which will get invoked immediately. This is a nice article I found while writing this article where you can learn more about Async/Await.

Inside the function, we will be calling three more async functions with different purposes.

  • getWebsiteLinks(url) - This will scrap the links of all webpages which contain the ebooks and store them in an array called linklist.
  • downloadLinks(linkList) - This will scrap download links of the files from each page and store them in an array called dlinklist.
  • downloadFiles(dlinkList) - This will download the files using the links stored in dlinklist.
const getWebsiteLinks = async (url) => {
  try {
    const response = await axios.get(url)
    const $ = cheerio.load(response.data)
    $('div.bookContainer').each(function (i, elem) {  
      let link = $(elem).find('a').attr('href')
      linkList.push(url+link)
    });
  } catch (error) {
    console.error(error)
  }
}

Using Axios we sent a get request to books.goalkicker.com and store the response to a variable called response. If you console.log the response, you will see a lot of information, most of which we do not need. This is what basically our browser gets when we visit a link. We will be focusing the information stored under response.data. It contains the HTML structure of the page which we can see when we use Inspect Element tool in browsers. We now load the markup into a Cheerio variable to parse and manipulate it.

When we were inspecting elements of the page in the browser, we found that links of each book were inside divs with class name - "bookContainer". So we use each method (you can call it forEach of jQuery) on all the divs having the same class name. Then inside each occurrence, we pick the a tag and store what the href attribute of that tag contains, into a variable called link. Then we push it into an array called linkList.

const downloadLinks = async (linkList) => {
  for (const link of linkList) {
    const response = await axios.get(link)
    const $ = cheerio.load(response.data)
    let name = $('.download').attr("onclick")
    name = name.match(/location\.href\s*=\s*['"]([^'"]*)['"]/)
    let dlink = link + name[1]
    dlinkList.push({
      name: name[1],
      dlink: dlink
    })
  }
}

Now that we have all the links of the pages, we travel to each of them using a for loop while making get requests using Axios and parse the response using Cheerio. Again using Inspect Element tool, we found out that the download link is inside a button tag but attached with some other stuff (for example - "location.href='DotNETFrameworkNotesForProfessionals.pdf"). For that, we used regex (Regular Expressions) to clean the string. Once we got the desired download link, we push it into an array called dlinkList.

const downloadFiles = async (dlinkList) => {
  for (const link of dlinkList) {
    let name = link.name + '.pdf'
    let url = link.dlink
    let file = fs.createWriteStream(name)
    const response = await axios({
      url,
      method: 'GET',
      responseType: 'stream'
    })
    response.data.pipe(file)
  }
}

We are almost close to the end of our scraping journey. Now we need to download the files using the links we extracted above. This is a bit confusing so I would suggest you read fs documentation thoroughly. First, we create a file name with the proper extension. With the current setup, the files will be downloaded into the folder where app.js is stored. Using fs.createWriteStream(), we create a writable stream where we can pipe the data of the file we are gonna download. Using Axios, we send a request to download the file and with response.data.pipe(file) we save the file.

Once we run the whole script, within a few minutes, all the books the site have will get downloaded into your machine. You can see the full code in my GitHub repository.

This is the first time I have written a tutorial so please do let me know if I have made any mistakes and how can I improve more. Have a good day ahead.

Photo by sergio souza on Unsplash