如何在NodeJS中使用Puppeteer运行无头浏览器

在 NodeJS 中使用无头浏览器允许开发人员使用代码控制 Chrome，提供额外的功能以便与网页交互并模拟人类行为。

今天，我们将研究如何使用该语言中最流行的 Puppeteer 进行网页抓取。

什么是 NodeJS 中的无头浏览器？

NodeJS 中的无头浏览器是一种无需图形用户界面(GUI) 即可运行的自动化浏览器，消耗的资源更少，速度更快。它允许 JavaScript 像人类一样呈现和执行操作（提交表单、滚动等）。

如何使用 Puppeteer 在 NodeJS 中运行无头浏览器

现在您知道什么是无头浏览器，让我们深入了解如何使用Puppeteer运行一个浏览器来与页面上的元素交互并抓取数据。

作为目标站点，我们将使用名为ScrapeMe的 Pokémon 商店。

先决条件

确保在继续之前安装了NodeJS（npm 附带）。

创建一个新目录并使用npm init -y. 然后，使用以下命令安装 Puppeteer：

npm i [email protected]

注意： Puppeteer 将在运行安装命令后下载最新版本的 Chromium。如果您选择手动设置，这在您想要连接到远程浏览器或自己管理浏览器时很有用，puppeteer-core默认情况下该软件包不会下载 Chromium。

scraper.js然后，在上面初始化的无头浏览器 JavaScript 项目中创建一个新文件。

touch scraper.js

第一步：打开页面

让我们首先打开我们要抓取的网站。为此，启动一个浏览器实例，创建一个新页面并导航到我们的目标站点。

const puppeteer = require('puppeteer');

(async () => {
  // Launches a browser instance
  const browser = await puppeteer.launch();
  // Creates a new page in the default browser context
  const page = await browser.newPage();
  // Navigates to the page to be scraped 
  const response = await page.goto('https://scrapeme.live/shop/');

  // logs the status of the request to the page
  console.log('Request status: ', response?.status(), 'nnnn');

  // Closes the browser instance
  await browser.close();
})();

注意：close()最后会调用该方法来关闭 Chromium 及其所有页面。

node scraper在终端上使用运行代码。它会将请求的状态代码记录到 ScrapeMe，如下图所示：

恭喜！200说明你的请求成功了。现在，您已准备好进行一些抓取。

第 2 步：抓取数据

我们的目标是抓取主页上的所有 Pokémon 名称并将它们显示在列表中。这是您需要做的：

使用您的常规浏览器转到ScrapeMe并找到任何 Pokémon 卡片，然后右键单击该生物的名称并选择“检查”以打开您的 Chrome DevTools。浏览器将突出显示所选元素，如下所示。

持有 Pokémon 名称的选定元素是一个h2类woocommerce-loop-product__title。如果您检查该页面上的其他类，您会发现它们都具有相同的类。我们可以使用它来定位所有名称元素，然后抓取它们。

Puppeteer Page API提供了多种方法来选择页面上的元素。一个例子是Page.$$eval(selector, pageFunction, args)， where针对它的第一个参数选择器$$eval()运行。document.querySelectorAll然后它将结果返回给它的第二个参数，回调页面函数，用于进一步的操作。

让我们利用这一点。scraper.js使用以下代码更新您的文件：

const puppeteer = require('puppeteer');

(async () => {
  // Launches a browser instance
  const browser = await puppeteer.launch();
  // Creates a new page in the default browser context
  const page = await browser.newPage();

  // remove timeout limit
  page.setDefaultNavigationTimeout(0); 

  // Navigates to the page to be scraped 
  await page.goto('https://scrapeme.live/shop/');

  // gets an array of all Pokemon names
  const names = await page.$$eval('.woocommerce-loop-product__title', (nodes) => nodes.map((n) => n.textContent));
  
  console.log('Number of Pokemon: ', names.length);
  console.log('List of Pokemon: ', names.join(', '), 'nnn');

  // Closes the browser instance
  await browser.close();
})();

与上一个示例一样，我们看到创建浏览器实例和页面的类似操作。但是，要禁用超时及其错误，page.setDefaultNavigationTimeout(0);请将导航超时设置为零毫秒而不是默认的 3000 毫秒。

此外，n.textContent获取具有类的所有节点或元素的文本woocommerce-loop-product__title。同时，该$$eval()函数返回一个神奇宝贝名称数组。

最后，代码记录了抓取的 Pokémon 数量，并创建了一个以逗号分隔的名称列表。

再次运行脚本，您将看到如下输出：

接下来让我们看看如何使用 Puppeteer 与网页交互，Puppeteer 是无头浏览器为我们提供的一项额外功能。

与页面上的元素交互

有一些页面 API用于与页面上的元素进行交互。例如，该Page.type(selector, text)方法可以发送keydown和keyup输入事件。

看一下ScrapeMe网站右上角的搜索栏，我们可以使用。检查元素，您会看到：

搜索字段具有woocommerce-product-search-field-0ID。我们可以用这个选择元素并在其上触发输入事件。为此，请在文件中的page.goto()和方法之间添加以下代码。browser.close()scraper.js

const searchFieldSelector = '#woocommerce-product-search-field-0';

const getSearchFieldValue = async () => await page.$eval(searchFieldSelector, el => el.value);

console.log('Search field value before: ', await getSearchFieldValue());
// type instantly into the search field
await page.type(searchFieldSelector, 'Vulpix');
console.log('Search field value after: ', await getSearchFieldValue());

我们使用page.type()方法在字段中输入“Vulpix”一词。

重新运行 scraper 文件，你应该得到这个输出：

搜索框的值发生变化，说明输入事件触发成功。

在 NodeJS 中使用 Puppeteer 进行高级无头浏览

在本节中，您将学习如何启动 Puppeteer 无头浏览器游戏。

截图

想象一下，您想要获取屏幕抓图，例如目视检查您的抓取工具是否正常工作。好消息是可以通过调用该screenshot()方法使用 Puppeteer 进行屏幕截图。

// Takes a screenshot of the search results
await page.screenshot({ path: 'search-result.png' })
console.log('Screenshot taken');

注意：该path选项指定屏幕截图的位置和文件名。

再次运行抓取文件，执行后会在项目根目录下生成一个“search-result.png”图片文件：

等待内容加载

最好的做法是在网络抓取时等待整个页面或部分页面加载，以确保所有内容都已显示。让我们看一个例子来说明原因。

假设您想在 ScrapeMe 的主页上获取第一个 Pokémon 的描述。为此，我们可以在其图像上模拟点击事件，这将触发另一个包含其描述的页面加载。

woocommerce-LoopProduct-link在主页上检查该 Pokémon 的图像会发现与和类的链接woocommerce-loop-product__link。

而且，在点击 Pokémon 图片后加载的页面上，描述显示了一个div带有类的元素woocommerce-product-details__short-description。

我们将使用这些类作为元素的选择器。因此，您需要使用以下代码更新page.goto()和方法之间的代码：browser.close()

// Selectors
const pokemonDetailsSelector = '.woocommerce-product-details__short-description',
  pokemonLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
// Clicks on the first Pokemon image link (triggers a new page load)
await page.$$eval(pokemonLinkSelector, (links) => links[0]?.click());
// Gets the content of the description from the element
const description = await page.$eval(pokemonDetailsSelector, (node) => node.textContent);
// Logs the description of the Pokemon
console.log('Description: ', description);

在那里，该$$eval()方法选择所有可用的 Pokémon 链接和点击，并且该$eval()方法以描述元素为目标并获取其内容。

现在是运行爬虫的时候了。不幸的是，我们得到一个错误：

这是因为 Puppeteer 试图在加载之前获取描述元素。

要解决此问题，请添加waitForSelector(selector)方法以等待描述元素的选择器。只有当描述可用时，此方法才会解析。我们也可以等待页面加载waitForNavigation. 两者都可以，但我们建议尽可能等待选择器。

// Selectors
const pokemonDetailsSelector = '.woocommerce-product-details__short-description',
  pokemonLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
// Clicks on the first Pokemon image link (triggers a new page load)
await page.$$eval(pokemonLinkSelector, (links) => links[0]?.click());
// Waits for the element with the description of the Pokemon
await page.waitForSelector(pokemonDetailsSelector);
// Gets the content of the description from the element
const description = await page.$eval(pokemonDetailsSelector, (node) => node.textContent);
// Logs the description of the Pokemon
console.log('Description: ', description);

再次运行刮刀。这次没有出现错误，并且记录了神奇宝贝的描述。

抓取多个页面

你还记得我们早些时候抓取了一份神奇宝贝列表吗？

我们还可以从各自的页面上抓取每个人的描述。

为此，使用 Pokémon 名称和链接数组进行循环，更新page.goto()和browser.close()方法之间的代码：

// Selectors
const pokemonDetailsSelector = '.woocommerce-product-details__short-description',
  pokemonLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
// Get a  list of Pokemon names and links
const list = await page.$$eval(pokemonLinkSelector,
  ((links) => links.map(link => {
      return {
          name: link.querySelector('h2').textContent,
          link: link.href
      };
  }))
);
for (const { name, link } of list) {
  await Promise.all([
    page.waitForNavigation(),
    page.goto(link),
    page.waitForSelector(pokemonDetailsSelector),
  ]);
  const description = await page.$eval(pokemonDetailsSelector, (node) => node.textContent);
  console.log(name + ': ' + description);
}

当你运行 scraper 文件时，你应该开始在终端上看到这些生物及其描述。

优化 Puppeteer 脚本

与大多数工具一样，可以优化 Puppeteer 以提高其总体速度和性能。以下是一些方法：

阻止不必要的请求

阻止不需要的请求会减少请求的数量。在 Puppeteer 中，您可以为不需要的文件类型创建拦截器。

由于我们在针对 ScrapeMe 时一直只使用 HTML 文档，因此阻止其他类型的文档（如图像或样式表）是有道理的。

// Allows interception of requests
await page.setRequestInterception(true);
// Listens for requests being triggered
page.on('request', (request) => {
  if (request.resourceType() === 'document') {
    // Allow request to be maded
    request.continue();
  } else {
    // Cancel request
    request.abort();
  }
});

缓存资源

缓存资源将阻止 Puppeteer 无头浏览器的进一步请求。每个新的浏览器实例都会为其用户数据目录创建一个临时目录，其中包含用户缓存目录。

我们可以通过userDataDir在Puppeteer.launch()方法中指定选项来为所有浏览器实例指定一个永久目录。

// Launches a browser instance
const browser = await puppeteer.launch({
  userDataDir: './user_data',
});

设置无头模式

headless 选项是true默认的。将值更改为false将阻止 Puppeteer 以无头模式运行；相反，它将使用 GUI 运行。

headlessPuppeteer 允许您使用方法的选项设置浏览器模式Puppeteer.launch()。

// Launches a browser instance
const browser = await puppeteer.launch({
  headless: false,
});

避免被 Puppeteer 阻塞

网络抓取工具面临的一个常见问题是被阻止，因为许多网站都采取了措施来阻止行为像机器人的访问者。但这里有一些方法可以防止这种情况发生：

使用代理。
限制请求。
使用有效的User-Agent。
模仿用户行为。
实施 Puppeteer 的Stealth 插件。
使用像ZenRows这样的网络抓取 API 。

有关更深入的信息，请查看我们的指南，了解如何避免使用 Puppeteer 进行检测。

结论

在本教程中，我们了解了 NodeJS 中的无头浏览器是什么。更具体地说，您现在知道如何使用 Puppeteer 进行无头浏览器网络抓取，并可以从其高级功能中受益。

然而，大规模运行 Puppeteer 或避免被阻止将被证明是具有挑战性的。

如何在NodeJS中使用Puppeteer运行无头浏览器

什么是 NodeJS 中的无头浏览器？

如何使用 Puppeteer 在 NodeJS 中运行无头浏览器

先决条件

第一步：打开页面

第 2 步：抓取数据

与页面上的元素交互

在 NodeJS 中使用 Puppeteer 进行高级无头浏览

截图

等待内容加载

抓取多个页面

优化 Puppeteer 脚本

阻止不必要的请求

缓存资源

设置无头模式

避免被 Puppeteer 阻塞

结论

相关

如何使用Python抓取亚马逊网页

如何使用C#绕过Cloudflare

如何使用jQuery进行网页抓取

如何将旋转代理与cURL一起使用

Playwright 和Selenium的区别是什么

如何在Python中轮换代理IP地址

什么是 NodeJS 中的无头浏览器？

如何使用 Puppeteer 在 NodeJS 中运行无头浏览器

先决条件

第一步：打开页面

第 2 步：抓取数据

与页面上的元素交互

在 NodeJS 中使用 Puppeteer 进行高级无头浏览

截图

等待内容加载

抓取多个页面

优化 Puppeteer 脚本

阻止不必要的请求

缓存资源

设置无头模式

避免被 Puppeteer 阻塞

结论

相关

类似文章