Playwright Vs. Puppeteer: 你应该选择哪个

Playwright 与 Puppeteer 的辩论是一场大讨论，因为两者都是用于浏览器自动化的出色 Node.js 库。尽管他们做的事情几乎相同，但 Puppeteer 和 Playwright 有一些显着差异。

让我们在这里快速浏览一下历史：

Chrome 开发团队于 2017 年创建了 Puppeteer，以弥补 Selenium 在浏览器自动化方面的不可靠性。

微软后来推出了 Playwright，与 Puppeteer 类似，它能够在浏览器上高效地运行复杂的自动化测试。然而这一次，他们在测试环境中引入了更多工具。

那么哪一个是最好的呢？

让我们看看 Puppeteer 和 Playwright 的区别，看看是什么让每个库都独一无二。

Puppeteer 和 Playwright：主要区别是什么？

Puppeteer 和 Playwright 是无头浏览器，最初设计用于 Web 应用程序的端到端自动化测试。它们也用于其他目的，例如网络抓取。

尽管它们具有相似的用例，但自动化工具之间的一些主要区别是：

Playwright 支持 Python、Golang、Java、JavaScript 和 C#，而 Puppeteer 仅支持 JavaScript 和Python的非官方端口。
Playwright 支持三种浏览器：Chromium、Firefox 和 WebKit，但 Puppeteer 仅支持 Chromium。

Playwright

Playwright 是一个端到端的网络测试和自动化库。尽管该框架的主要作用是测试 Web 应用程序，但也可以将其用于 Web 抓取目的。

Playwright的优势是什么？

通过单个 API，该库允许您使用 Chromium、Firefox 或 WebKit 进行测试。除此之外，跨平台框架在 Windows、Linux 和 MacOS 中运行速度很快。
Playwright 支持 Python、Golang、Java、JavaScript 和 C#。
Playwright 比大多数测试框架（如 Cypress）运行得更快。

Playwright的劣势是什么？

Playwright 缺乏对 Ruby 和 Java 的支持。
Playwright 使用桌面浏览器来模拟移动设备，而不是真实的设备。

Playwright浏览器选项

浏览器选项和页面方法控制测试环境。

Headless：这决定了您是否在测试期间看到浏览器。默认情况下，该值设置为 false。您可以将其更改为 true 以在测试期间查看浏览器。
SlowMo：缓慢的移动降低了页面上动作之间的切换速度。例如，500 值表示将操作延迟 500 毫秒。
DevTools：您可以在启动目标页面时打开 Chrome Dev Tools。请注意，此选项仅适用于 Chromium。

await playwright.chromium.launch({ devtools: true })

编剧页面对象方法

下面是一些控制启动页面的方法。

对象方法	意义
goto()	第一次访问页面
reload()	该方法刷新页面
evaluate()	此方法为您提供了一个迷你 API，用于获取元素并使用 JavaScript 在 Node.js 环境中为 DOM 操作它。或者，您可以使用`$eval()`、`$$eval()`和`$()`。`$$()`
screenshot()	截图页面。
setDefaultTimeout()	它让无头浏览器在抛出错误之前等待指定持续时间的操作。
keyboard.press()	此方法允许您指定要按的键。
waitForSelector()	它告诉页面延迟操作，直到加载了特定的选择器。
locator()	定位器类使用多个选择器组合来抓取元素。
click()	此方法允许您指定要单击其选择器的标签。

[/su_table]

与 Playwright 一起进行网页抓取

作为支持 Playwright 与 Puppeteer 辩论的快速教程，让我们使用 Playwright 从 Vue Storefront中抓取产品标题、价格和图像 URL ，并将结果保存在 CSV 文件中。

首先导入 Playwright 和文件系统 (fs) 模块，以将抓取的数据保存在 CSV 文件中。

import playwright from 'playwright' // web scraping 
import fs from 'fs' // saving data to CSV

记得在 package.json 文件中指定模块类型；否则，import语法将不起作用。

由于 Playwright 在异步环境中运行，而 async-await 语法仅在异步函数中运行，因此您可以创建一个异步主函数并在其中编写爬虫代码。

const main = async () => { 
    // write some code 
} 
main()

下一步是启动浏览器并创建一个新页面，所以让我们继续以引导模式启动 Chromium。

const browser = await playwright.chromium.launch({ headless: false })

你打开了浏览器；我们已经完成了一半。使用浏览器 API 的方法创建页面对象newPage()。

const page = await browser.newPage()

要抓取 Vue Storefront 的产品详细信息，请访问“厨房”类别页面并按“最新”对项目进行排序。

await page.goto('https://demo.vuestorefront.io/c/kitchen?sort=NEWEST')

或者，您可以使抓取工具自动定位并一次单击每个元素，直到您到达目标页面。

让我们创建一个 CSV 文件并写入其标题以废弃标题、价格和图像 URL。

fs.writeFileSync('products.csv', 'title,price,imageUrln')

使用for-of循环，从每个（产品）子元素中提取标题、价格和图像 URL，如下所示：

for (const product of products) { 
    let title, price, imageUrl 
    // extracting the target portions into title, price and image urls, respectively 
    title = await page.evaluate(e => e.querySelector('.sf-product-card__title').textContent.trim(), product) 
    price = await page.evaluate(e => e.querySelector('.sf-price__regular').textContent.trim(), product) 
    imageUrl = await page.evaluate(e => e.querySelector('.sf-image.sf-image-loaded').src, product) 
    // for every loop, append the extracted data into the CSV file 
    fs.appendFile('products.csv', `${title},${price},${imageUrl}n`, e => { if (e) console.log(e) }) 
}

关闭浏览器并运行脚本文件。

await browser.close()

下面是完整代码的样子。

// in index.js 
 
// Import the modules: playwright (web scraping) and fs (saving data to CSV) 
import playwright from 'playwright' 
import fs from 'fs' 
 
// create asynchronous main function 
const main = async () => { 
    // launch a visible chromium browser 
    const browser = await playwright.chromium.launch({ headless: false }) 
 
    // create a new page object 
    const page = await browser.newPage() 
    // visit the target page 
    await page.goto('https://demo.vuestorefront.io/c/kitchen?sort=NEWEST') 
    // create a CSV file, in readiness to save the data we are about to scrape 
    fs.writeFileSync('products.csv', 'title,price,imageUrln') 
 
    // download an array of divs containing the target data 
    const products = await page.$$('.products__grid > .sf-product-card.products__product-card') 
    // loop through the array, 
    for (const product of products) { 
        let title, price, imageUrl 
        // dissecting the target portions into title, price and image urls, respectively 
        title = await page.evaluate(e => e.querySelector('.sf-product-card__title').textContent.trim(), product) 
        price = await page.evaluate(e => e.querySelector('.sf-price__regular').textContent.trim(), product) 
        imageUrl = await page.evaluate(e => e.querySelector('.sf-image.sf-image-loaded').src, product) 
        // for every loop, append the dissected data into the already created CSV file 
        fs.appendFile('products.csv', `${title},${price},${imageUrl}n`, e => { if (e) console.log(e) }) 
    } 
    // Close the (running headless) browser when the mission is accomplished 
    await browser.close() 
} 
 
// don't forget to run the main() function 
main()

Puppeteer

Puppeteer是 JavaScript (Node.js) 的自动化库，与 Playwright 不同，它默认下载并使用 Chromium。它更侧重于 Chrome DevTools，使其成为网络抓取的首选库之一。

Puppeteer有什么优势？

Puppeteer 简化了浏览器自动化的入门。它使用非标准的 DevTools 协议控制 Chrome。

Puppeteer 的缺点是什么？

Puppeteer 仅支持 JavaScript (Node.js)。
虽然对 Firefox 支持的开发正在进行中，但 Puppeteer 目前仅支持 Chromium。

Puppeteer 中的浏览器选项

大多数 Playwright 的浏览器选项都可以在 Puppeteer 中使用Headless，SlowMo并且可以使用。DevTools

await puppeteer.launch({ headless: false, slowMo: 500, devtools: true })

Puppeteer 中的页面对象方法

同样，大多数 Playwright 的页面对象方法都适用于 Puppeteer。这里是其中的一些。

对象方法	意义
goto()	第一次访问页面
goForward()	前进
goBack()	返回上一页
reload()	该方法刷新页面
evaluate()	此方法为您提供了一个迷你 API，用于获取元素并使用 JavaScript 在 Node.js 环境中为 DOM 操作它。或者，您可以使用`$eval()`、`$$eval()`和`$()`。`$$()`
screenshot()	截图页面。
setDefaultTimeout() 或 setDefaultNavigationTimeout()	它让无头浏览器在抛出错误之前等待指定持续时间的操作。
keyboard.press()	此方法允许您指定要按的键。
waitForSelector()	它告诉页面延迟操作，直到加载了特定的选择器。
waitFor()	延迟后续动作。
locator()	定位器类使用多个选择器组合来抓取元素。
click()	此方法允许您指定要单击其选择器的标签。
select()	在选择元素中选择一个选项。

[/su_table]

使用 Puppeteer 进行网页抓取

要使用 Puppeteer 抓取网页，请导入Puppeteer网页抓取模块和fs将抓取数据保存到 CSV 文件中的模块。

import puppeteer from 'puppeteer' // web scraping 
import fs from 'fs' // saving scraped data

创建一个异步函数来运行无头浏览器。

const main = async () => { 
    // write some code 
} 
main()

现在启动无头浏览器并创建一个新页面。

const browser = await puppeteer.launch({ headless: false }) 
const page = await browser.newPage()

使用该goto()方法，在抓取数据之前访问目标页面。

await page.goto('https://demo.vuestorefront.io/c/kitchen?sort=NEWEST')

接下来，创建一个 CSV 文件来存储抓取的数据。

fs.writeFileSync('products.csv', 'title,price,imageUrln')

for-of在将数据附加到 CSV 文件之前，使用循环提取产品标题、价格和图像 URL。

for (const product of products) { 
    let title, price, imageUrl 
    // extracting the target portions into title, price and image urls, respectively 
    title = await page.evaluate( e => e.querySelector('.sf-product-card__title').textContent.trim(), product) 
    price = await page.evaluate( e => e.querySelector('.sf-price__regular').textContent.trim(), product) 
    imageUrl = await page.evaluate( e => e.querySelector('.sf-image.sf-image-loaded').src, product) 
    // for every loop, append the extracted data into the CSV file 
    fs.appendFile('products.csv', `${title},${price},${imageUrl}n`, e => { if (e) console.log(e) }) 
}

最后，关闭浏览器并运行脚本。

await browser.close()

完整代码如下所示：

// Import the modules: puppeteer (web scraping) and fs (saving data to CSV) 
import puppeteer from 'puppeteer' 
import fs from 'fs' 
 
// create asynchronous main function 
const main = async () => { 
    // launch a headed chromium browser 
    const browser = await puppeteer.launch({ headless: false }) 
 
    // create a new page object 
    const page = await browser.newPage() 
    // visit the target page 
    await page.goto('https://demo.vuestorefront.io/c/kitchen?sort=NEWEST') 
    // create a CSV file, in readiness to save the data we are about to scrape 
    fs.writeFileSync('products.csv', 'title,price,imageUrln') 
 
    // download an array of divs containing the target data 
    const products = await page.$$('.products__grid > .sf-product-card.products__product-card') 
    // loop through the array, 
    for (const product of products) { 
        let title, price, imageUrl 
        // dissecting the target portions into title, price and image urls, respectively 
        title = await page.evaluate( e => e.querySelector('.sf-product-card__title').textContent.trim(), product) 
        price = await page.evaluate( e => e.querySelector('.sf-price__regular').textContent.trim(), product) 
        imageUrl = await page.evaluate( e => e.querySelector('.sf-image.sf-image-loaded').src, product) 
        // for every loop, append the dissected data into the already created CSV file 
        fs.appendFile('products.csv', `${title},${price},${imageUrl}n`, e => { if (e) console.log(e) }) 
    } 
    // Close the (running headless) browser when the mission is accomplished 
    await browser.close() 
} 
 
// don't forget to run the main() function 
main()

Playwright或Puppeteer：哪个更快？

比较 Puppeteer 与 Playwright 的性能可能会很棘手，但让我们找出哪个库名列前茅。

让我们创建一个名为 performance.js 的第三个脚本文件，并在其中运行 Playwright 和 Puppeteer 的代码，同时计算每个函数抓取 Vue Storefront 数据所需的时间。

// in performance.js 
 
const playwrightPerformance = async () => { 
    // START THE TIMER 
    console.time('Playwright') 
    // Playwright scraping code 
    // END THE TIMER 
    console.timeEnd('Playwright') 
} 
 
const puppeteerPerformance = async () => { 
    // START THE TIMER 
    console.time('Puppeteer') 
    // Puppeteer scraping code 
    // END THE TIMER 
    console.timeEnd('Puppeteer') 
} 
 
playwrightPerformance() 
puppeteerPerformance()

我们将在各自的函数中插入 Playwright 和 Puppeteer 抓取代码，调整为无头浏览，然后运行 performance.js 文件五次以获得平均运行时间。

以下是每个库的平均持续时间：

Playwright➡️ (7.580 + 7.372 + 6.639 + 7.411 + 7.390) = (36.392 / 5) = 7.2784s
Puppeteer➡️ (6.656 + 6.653 + 6.856 + 6.592 + 6.839) = (33.596 / 5) = 6.7192s

Puppeteer 在速度方面赢得了Playwright

值得注意的是，这些结果是基于我们自己的测试得出的。如果您想运行自己的，请继续使用上面共享的迷你指南。

Playwright比Puppeteer好吗？

总的来说，没有 Puppeteer 与 Playwright 的比较会给你一个直接的答案，告诉你哪个是更好的选择。这取决于多种因素，例如长期库支持、跨浏览器支持以及您对浏览器自动化的特定需求。

以下是 Playwright 和 Puppeteer 的一些显着特征：

特征	Playwright	Puppeteer
Supported Languages	Python、Java、JavaScript 和 C#	JavaScript
Supported Browsers	Chromium、Firefox 和 WebKit	铬
Speed	快速地	快点

[/su_table]

结论

如您所见，Playwright 和 Puppeteer 各有优势，因此在选择其中一个库之前，您应该考虑抓取项目的具体情况和个人需求。

然而，网络抓取面临的一个常见问题是，一些网站会检测到机器人并阻止无头浏览，尤其是当您单击按钮并快速发送多个流量时。一个好的解决方案是在后续操作之前引入计时器。

例如，您可以对 Puppeteer 进行编程以模仿（人类）用户，方法是在登录表单中输入详细信息后等待 0.1 秒再单击按钮。然而，多个计时器的缺点是它们会减慢您的浏览速度，而且大多数网站都可以检测到它们。

Playwright Vs. Puppeteer: 你应该选择哪个

Puppeteer 和 Playwright：主要区别是什么？

Playwright

Playwright的优势是什么？

Playwright的劣势是什么？

Playwright浏览器选项

编剧页面对象方法

与 Playwright 一起进行网页抓取

Puppeteer

Puppeteer有什么优势？

Puppeteer 的缺点是什么？

Puppeteer 中的浏览器选项

Puppeteer 中的页面对象方法

使用 Puppeteer 进行网页抓取

Playwright或Puppeteer：哪个更快？

Playwright比Puppeteer好吗？

结论

相关

什么是网页抓取，它有什么用途？

如何使用Selenium绕过Cloudflare

什么是Cloudflare错误1010以及如何避免

如何避免Puppeteer被检测

如何使用Selenium Wire

如何使用Ruby语言实现网页抓取

Puppeteer 和 Playwright：主要区别是什么？

Playwright

Playwright的优势是什么？

Playwright的劣势是什么？

Playwright浏览器选项

编剧页面对象方法

与 Playwright 一起进行网页抓取

Puppeteer

Puppeteer有什么优势？

Puppeteer 的缺点是什么？

Puppeteer 中的浏览器选项

Puppeteer 中的页面对象方法

使用 Puppeteer 进行网页抓取

Playwright或Puppeteer：哪个更快？

Playwright比Puppeteer好吗？

结论

相关

类似文章