In the nodejs world, Puppeteer is the go-to library for web scraping as it provides an API to control the Chromium browser. But in case you want are writing a Lambda function where you want to use Puppeteer, it is not gonna work as your Lambda function when bundled with the Chromium binary is gonna go way up the 50 MB limit of packages allowed by Lambda. So simply put, we can’t use Puppeteer in Lambda functions.

Solution

Enter chrome-aws-lambda. This is a library that provides the exact functionality of puppeteer in Lambda using the puppeteer-core library as a dependency. Let’s get started.

Note: This post assumes that you have sufficient knowledge of Lambda and nodejs.

To get started, install the dependencies:

npm install chrome-aws-lambda --save-prod
npm install puppeteer-core --save-prod

In the index.js file for the Lambda function, paste the following code:

const chromium = require('chrome-aws-lambda');

exports.handler = async (event, context, callback) => {
    var baseUrl = 'https://github.com/enterprise';
    
    let browser = null;
    
    try {
        // Initialize and configure the Chromium browser
        browser = await chromium.puppeteer.launch({
            args: chromium.args,
            defaultViewport: chromium.defaultViewport,
            executablePath: await chromium.executablePath,
            headless: true,
            devtools: false
        });
        
        // Start a new page
        let page = await browser.newPage();
        
        await page.setViewport({ width: 1200, height: 800 });
        // Start navigation until the DOM content is loaded completely
        await page.goto(baseUrl, ['load','domcontentloaded','networkidle0']);
        
        // Return the success response
        callback(null, "Success");
    } catch (error) {
        console.log('Error while scanning Geocaching.com. Error = ' + error.toString());
        callback(null, "Failure");
    } finally {
        if (browser !== null) {
          await browser.close();
        }
    }
};

Let’s go into the details. First up, the exports.handler method is async. We require this as we are using many of the asynchronous operations for Chromium and we will need to wait for them to finish using the await keyword.

Next we initialise the browser, fire up a new page and navigate to the GitHub Enterprise page. The line await page.goto(baseUrl, ['load','domcontentloaded','networkidle0']); makes the browser to wait until the DOM is loaded and till there are no more than 0 connections for 500 ms. More info here.

Autoscrolling

Next, let’s assume that we need to scroll down to the bottom of the page. Add the following method to the index.js file:

async function autoScroll(page){
    await page.evaluate(async () => {
        await new Promise((resolve, reject) => {
            //debugger;
            var totalHeight = 0;
            var distance = 100;
            var timer = setInterval(() => {
                var scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;
                
                if(totalHeight >= scrollHeight){
                    clearInterval(timer);
                    resolve();
                }
            }, 100);
        });
    });
}

Notice the line of code debugger; This is handy if you have set devTools to true in the configuration above. The debugger keyword is going to automatically trigger a breakpoint in the said location when our code is running.

Call the autoScroll method right before calling the callback in the exports.handler() method:

// This function is going to make Chromium to scroll down to the bottom of the page.
        await autoScroll(page);
        
        // Return the success response
        callback(null, "Success");

The above function uses a timer to keep scrolling down until it reaches the end.

When you scroll down to the bottom of the page, as of the current date, this GitHub page looks like this:

And we are interested in fetching the text from the list that is marked by the red rectangle. And when we inspect this HTML in the DevTools, we see the elements we need to parse to get the text:

So we will get each of these feature texts.

Parsing HTML

To do that, we need a parsing library. There is an excellent library called Cheerio (which is the implementation of core jQuery for the server) that we are going to use to accomplish that.

Install the dependency:

npm install cheerio --save-prod

To parse the HTML, we will add the following function to the index.js file:

// Parse HTML according to our needs using the Cheerio library
async function extractHtml(page){
    // Mark the piece of HTML that we want to prase from. This should be the parent of the HTML snippet 
    // that we want to process
    let mainHtml = await page.evaluate(el => el.innerHTML, await page.$('.col-lg-3.offset-md-1.offset-lg-2'));
    // Now load the above extracted HTML to Cheerio
    const $ = cheerio.load(mainHtml);
    // Create a new list in which we will save the extracted feature text
    let featureTexts = [];
    // Loop through the list of features at the end of the GitHub Enterpise's page and process each 'li' element
    $(".list-style-none.f5.mb-2.text-white > li").each(function(i, element){
        let featureText = $(element).text();
        // Save the text in the list
        featureTexts.push(featureText);
    });
    return featureTexts;
}

Let’s go through the code above:

  • We get the piece of HTML which we are interested in parsing and save it in the variable mainHtml.
  • We load the taken HTML and load it in Cheerio. The result is stored inside the variable $.
  • We then loop through each li within the ul element with the class "list-style-none f5 mb-2 text-white. This is how you do it in Cheerio.(Notice the . in the code above between these class names, the dot is placed in Cheerio in between class names instead of a space as done in actual HTML.)
  • We save the text features in a list and return it.

Call this extractHtml function right before calling the callback in the exports.handler() function in index.js file. Our entire index.js file looks like this now:

const chromium = require('chrome-aws-lambda');
const cheerio = require('cheerio');

exports.handler = async (event, context, callback) => {
    var baseUrl = 'https://github.com/enterprise';
    
    let browser = null;
    
    try {
        // Initialize and configure the Chromium browser
        browser = await chromium.puppeteer.launch({
            args: chromium.args,
            defaultViewport: chromium.defaultViewport,
            executablePath: await chromium.executablePath,
            headless: true,
            devtools: false
        });
        
        // Start a new page
        let page = await browser.newPage();
        
        await page.setViewport({ width: 1200, height: 800 });
        // Start navigation until the DOM content is loaded completely
        await page.goto(baseUrl, ['load','domcontentloaded','networkidle0']);
        
        // This function is going to make Chromium to scroll down to the bottom of the page.
        await autoScroll(page);

        // Now we extract and parse the HTML that we require
        let featuresList = await extractHtml(page);
        
        // Return the success response
        callback(null, featuresList);
    } catch (error) {
        console.log('Error while scanning Geocaching.com. Error = ' + error.toString());
        callback(null, "Failure");
    } finally {
        if (browser !== null) {
          await browser.close();
        }
    }
};

async function autoScroll(page){
    await page.evaluate(async () => {
        await new Promise((resolve, reject) => {
            debugger;
            var totalHeight = 0;
            var distance = 100;
            var timer = setInterval(() => {
                var scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;
                
                if(totalHeight >= scrollHeight){
                    clearInterval(timer);
                    resolve();
                }
            }, 100);
        });
    });
}

// Parse HTML according to our needs using hte Cheerio library
async function extractHtml(page){
    // Mark the piece of HTML that we want to prase from. This should be the parent of the HTML snippet 
    // that we want to process
    let mainHtml = await page.evaluate(el => el.innerHTML, await page.$('.col-lg-3.offset-md-1.offset-lg-2'));
    // Now load the above extracted HTML to Cheerio
    const $ = cheerio.load(mainHtml);
    // Create a new list in which we will save the extracted feature text
    let featureTexts = [];
    // Loop through the list of features at the end of the GitHub Enterpise's page and process each 'li' element
    $(".list-style-none.f5.mb-2.text-white > li").each(function(i, element){
        let featureText = $(element).text();
        // Save the text in the list
        featureTexts.push(featureText);
    });
    return featureTexts;
}

When you run the code from AWS Cloud9, this is the output that you get: