In the nodejs world, Puppeteer is the go-to library for web scraping as it provides an API to control the Chromium browser. But in case you want are writing a Lambda function where you want to use Puppeteer, it is not gonna work as your Lambda function when bundled with the Chromium binary is gonna go way up the 50 MB limit of packages allowed by Lambda. So simply put, we can’t use Puppeteer in Lambda functions.

Solution
Enter chrome-aws-lambda. This is a library that provides the exact functionality of puppeteer
in Lambda using the puppeteer-core
library as a dependency. Let’s get started.
Note: This post assumes that you have sufficient knowledge of Lambda and nodejs.
To get started, install the dependencies:
npm install chrome-aws-lambda --save-prod
npm install puppeteer-core --save-prod
In the index.js
file for the Lambda function, paste the following code:
const chromium = require('chrome-aws-lambda');
exports.handler = async (event, context, callback) => {
var baseUrl = 'https://github.com/enterprise';
let browser = null;
try {
// Initialize and configure the Chromium browser
browser = await chromium.puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: true,
devtools: false
});
// Start a new page
let page = await browser.newPage();
await page.setViewport({ width: 1200, height: 800 });
// Start navigation until the DOM content is loaded completely
await page.goto(baseUrl, ['load','domcontentloaded','networkidle0']);
// Return the success response
callback(null, "Success");
} catch (error) {
console.log('Error while scanning Geocaching.com. Error = ' + error.toString());
callback(null, "Failure");
} finally {
if (browser !== null) {
await browser.close();
}
}
};
Let’s go into the details. First up, the exports.handler
method is async
. We require this as we are using many of the asynchronous operations for Chromium and we will need to wait for them to finish using the await
keyword.
Next we initialise the browser, fire up a new page and navigate to the GitHub Enterprise page. The line await page.goto(baseUrl, ['load','domcontentloaded','networkidle0']);
makes the browser to wait until the DOM is loaded and till there are no more than 0 connections for 500 ms. More info here.
Autoscrolling
Next, let’s assume that we need to scroll down to the bottom of the page. Add the following method to the index.js
file:
async function autoScroll(page){
await page.evaluate(async () => {
await new Promise((resolve, reject) => {
//debugger;
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if(totalHeight >= scrollHeight){
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
Notice the line of code debugger;
This is handy if you have set devTools
to true in the configuration above. The debugger
keyword is going to automatically trigger a breakpoint in the said location when our code is running.
Call the autoScroll
method right before calling the callback
in the exports.handler()
method:
// This function is going to make Chromium to scroll down to the bottom of the page.
await autoScroll(page);
// Return the success response
callback(null, "Success");
The above function uses a timer to keep scrolling down until it reaches the end.
When you scroll down to the bottom of the page, as of the current date, this GitHub page looks like this:

And we are interested in fetching the text from the list that is marked by the red rectangle. And when we inspect this HTML in the DevTools, we see the elements we need to parse to get the text:

So we will get each of these feature texts.
Parsing HTML
To do that, we need a parsing library. There is an excellent library called Cheerio (which is the implementation of core jQuery for the server) that we are going to use to accomplish that.
Install the dependency:
npm install cheerio --save-prod
To parse the HTML, we will add the following function to the index.js
file:
// Parse HTML according to our needs using the Cheerio library
async function extractHtml(page){
// Mark the piece of HTML that we want to prase from. This should be the parent of the HTML snippet
// that we want to process
let mainHtml = await page.evaluate(el => el.innerHTML, await page.$('.col-lg-3.offset-md-1.offset-lg-2'));
// Now load the above extracted HTML to Cheerio
const $ = cheerio.load(mainHtml);
// Create a new list in which we will save the extracted feature text
let featureTexts = [];
// Loop through the list of features at the end of the GitHub Enterpise's page and process each 'li' element
$(".list-style-none.f5.mb-2.text-white > li").each(function(i, element){
let featureText = $(element).text();
// Save the text in the list
featureTexts.push(featureText);
});
return featureTexts;
}
Let’s go through the code above:
- We get the piece of HTML which we are interested in parsing and save it in the variable
mainHtml
. - We load the taken HTML and load it in Cheerio. The result is stored inside the variable $.
- We then loop through each
li
within theul
element with the class"list-style-none f5 mb-2 text-white
. This is how you do it in Cheerio.(Notice the . in the code above between these class names, the dot is placed in Cheerio in between class names instead of a space as done in actual HTML.) - We save the text features in a list and return it.
Call this extractHtml
function right before calling the callback
in the exports.handler()
function in index.js
file. Our entire index.js
file looks like this now:
const chromium = require('chrome-aws-lambda');
const cheerio = require('cheerio');
exports.handler = async (event, context, callback) => {
var baseUrl = 'https://github.com/enterprise';
let browser = null;
try {
// Initialize and configure the Chromium browser
browser = await chromium.puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: true,
devtools: false
});
// Start a new page
let page = await browser.newPage();
await page.setViewport({ width: 1200, height: 800 });
// Start navigation until the DOM content is loaded completely
await page.goto(baseUrl, ['load','domcontentloaded','networkidle0']);
// This function is going to make Chromium to scroll down to the bottom of the page.
await autoScroll(page);
// Now we extract and parse the HTML that we require
let featuresList = await extractHtml(page);
// Return the success response
callback(null, featuresList);
} catch (error) {
console.log('Error while scanning Geocaching.com. Error = ' + error.toString());
callback(null, "Failure");
} finally {
if (browser !== null) {
await browser.close();
}
}
};
async function autoScroll(page){
await page.evaluate(async () => {
await new Promise((resolve, reject) => {
debugger;
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if(totalHeight >= scrollHeight){
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
// Parse HTML according to our needs using hte Cheerio library
async function extractHtml(page){
// Mark the piece of HTML that we want to prase from. This should be the parent of the HTML snippet
// that we want to process
let mainHtml = await page.evaluate(el => el.innerHTML, await page.$('.col-lg-3.offset-md-1.offset-lg-2'));
// Now load the above extracted HTML to Cheerio
const $ = cheerio.load(mainHtml);
// Create a new list in which we will save the extracted feature text
let featureTexts = [];
// Loop through the list of features at the end of the GitHub Enterpise's page and process each 'li' element
$(".list-style-none.f5.mb-2.text-white > li").each(function(i, element){
let featureText = $(element).text();
// Save the text in the list
featureTexts.push(featureText);
});
return featureTexts;
}
When you run the code from AWS Cloud9, this is the output that you get:

This code definitely helped me to get started with my work! The whole package is just nice under 50MB. Thank you!
LikeLike