For my Puppeteer setup, I follow this basic structure:
const puppeteer = require("puppeteer");
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
/* utilize the page */
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
The finally
block ensures that the browser closes properly even if an error occurs. Errors are logged as needed. Chaining .catch
and .finally
calls keeps the mainline Puppeteer code neat and achieves the same outcome as below:
const puppeteer = require("puppeteer");
(async () => {
let browser;
try {
browser = await puppeteer.launch();
const [page] = await browser.pages();
/* utilize the page */
}
catch (err) {
console.error(err);
}
finally {
await browser?.close();
}
})();
No need to call newPage
since Puppeteer opens with a page already.
Regarding Express, simply include the entire code snippet above in your route, including let browser;
and excluding require("puppeteer")
. You may want to consider using an async middleware error handler.
You might wonder:
Is there a more efficient method than puppeteer and headless chrome for achieving similar results?
This depends on your specific requirements and definition of "better." If you only need to extract document.body.innerHTML
from static HTML, ditching Puppeteer in favor of making a request and utilizing Cheerio could be an alternative.
Additionally, you can optimize resource usage by avoiding opening and closing a new browser per request. Consider following this approach:
const express = require("express");
const puppeteer = require("puppeteer");
const asyncHandler = fn => (req, res, next) =>
Promise.resolve(fn(req, res, next)).catch(next);
const browserReady = puppeteer.launch({
args: ["--no-sandbox", "--disable-setuid-sandbox"]
});
const app = express();
app
.set("port", process.env.PORT || 5000)
.get("/", asyncHandler(async (req, res) => {
const browser = await browserReady;
const page = await browser.newPage();
try {
await page.goto(req.query.url || "http://www.example.com");
return res.send(await page.content());
}
catch (err) {
return res.status(400).send(err.message);
}
finally {
await page.close();
}
}))
.use((err, req, res, next) => res.sendStatus(500))
.listen(app.get("port"), () =>
console.log("listening on port", app.get("port"))
);
Lastly, avoid setting timeouts to 0 (e.g.,
page.setDefaultNavigationTimeout(0);
) to prevent potential script delays. If a timeout is necessary, set it for a reasonable duration, such as a few minutes at most.
Check out these resources too:
- Parallelism of Puppeteer with Express Router Node JS. How to pass page between routes while maintaining concurrency
- Puppeteer unable to run on heroku