Playwright 和Selenium的区别是什么

在 Playwright 与 Selenium 之间进行网络抓取选择时迷失方向并不奇怪，因为两者都是流行的开源自动化工具。

考虑您的抓取需求和标准很重要，例如兼容的语言、文档和浏览器支持。

让我们进入细节。我们将讨论它们的优缺点，以及一个关于如何使用 Playwright 和 Selenium 抓取网页的真实示例。

Playwright

Playwright是 Microsoft 开发的端到端 Web 测试和自动化库。尽管该框架的主要作用是测试 Web 应用程序，但也可以将其用于 Web 抓取目的。

Playwright的优势是什么？

Playwright的优势是：

它支持所有现代渲染引擎，包括 Chromium、WebKit 和 Firefox。
Playwright 可以在 Windows、Linux、macOS 或 CI 上使用。
它支持 TypeScript、JavaScript (NodeJS)、Python、.NET 和 Java。
Playwright 的执行速度比 Selenium 的快。
Playwright 支持自动等待并对元素进行相关检查。
您可以生成检查网页的选择器，并通过记录您的操作来生成场景。
Playwright 支持同时执行，也可以阻止不必要的资源请求。

Playwright的劣势是什么？

Playwright的优势是：

它只能处理模拟器，不能处理真实设备。
与 Selenium 相比，Playwright 没有很大的社区。
它不适用于旧版浏览器和设备。

与 Playwright 一起进行网页抓取

让我们来看一个快速的 Playwright 网络抓取教程，以比较 Playwright 与 Selenium 的抓取能力。我们将从Scrape This Site的第一页中提取 250 个表项。

首先导入所需的包并初始化浏览器实例：

from bs4 import BeautifulSoup 
from playwright.sync_api import sync_playwright 
 
with sync_playwright() as p: 
    # launch the browser instance and define a new context 
    browser = p.chromium.launch() 
    context = browser.new_context()

使用以下方法导航到目标网页page.goto()：

page = context.new_page() 
page.goto("https://www.scrapethissite.com/pages/simple/")

由于每个表条目都在一个div类中country，因此使用 CSS 类选择器使用该page.locator()方法定位 div 元素。此外，存储匹配元素的数量以便稍后循环：

countries = page.locator("div.country") 
n_countries = countries.count()

下一步是使用该extract_data()方法提取姓名、首都、人口和地区。像这样：

def extract_data(entry): 
    name = entry.locator("h3").inner_text().strip("n").strip() 
    capital = entry.locator("span.country-capital").inner_text() 
    population = entry.locator("span.country-population").inner_text() 
    area = entry.locator("span.country-area").inner_text() 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area}

使用函数提取数据extract_data，然后关闭浏览器实例：

data = [] 
 
for i in range(n_countries): 
    entry = countries.nth(i) 
    sample = extract_data(entry) 
    data.append(sample) 
 
browser.close()

恭喜！您已使用 Playwright 成功抓取网页。您的输出应该如下所示：

[ 
    {'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area (km sq)': '468.0'}, 
    {'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area (km sq)': '82880.0'}, 
    {'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area (km sq)': '647500.0'}, 
    {'name': 'Antigua and Barbuda', 'capital': "St. John's", 'population': '86754', 'area (km sq)': '443.0'}, 
    {'name': 'Anguilla', 'capital': 'The Valley', 'population': '13254', 'area (km sq)': '102.0'}, 
    ... 
]

如果您在任何时候迷路了，这是完整的 Playwright 代码：

from playwright.sync_api import sync_playwright 
 
def extract_data(entry): 
    name = entry.locator("h3").inner_text().strip("n").strip() 
    capital = entry.locator("span.country-capital").inner_text() 
    population = entry.locator("span.country-population").inner_text() 
    area = entry.locator("span.country-area").inner_text() 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area} 
 
with sync_playwright() as p: 
    # launch the browser instance and define a new context 
    browser = p.chromium.launch() 
    context = browser.new_context() 
    # open a new tab and go to the website 
    page = context.new_page() 
    page.goto("https://www.scrapethissite.com/pages/simple/") 
    page.wait_for_load_state("load") 
    # get the countries 
    countries = page.locator("div.country") 
    n_countries = countries.count() 
 
    # loop through the elements and scrape the data 
    data = [] 
 
    for i in range(n_countries): 
        entry = countries.nth(i) 
        sample = extract_data(entry) 
        data.append(sample) 
 
browser.close()

Selenium

Selenium 是用于网络抓取和网络自动化的最流行的开源工具之一。在使用 Selenium 进行抓取时，您可以自动化浏览器、与 UI 元素交互并在 Web 应用程序上模仿用户操作。Selenium 的一些核心组件包括 WebDriver、Selenium IDE 和 Selenium Grid。

Selenium的优点是什么？

Selenium的优点是：

它易于使用。
它可以通过使用 Appium 自动化大量浏览器，包括 IE、移动浏览器甚至移动应用程序。
它支持多种编程语言，如 Java、C#、Python、Perl、JavaScript 和 Ruby。
它可以在 Windows、macOS 和 Linux 上运行。

Selenium的缺点是什么？

Selenium的缺点是：

与Playwright相比，Selenium需要第三方工具来实现并行执行。
没有内置的报告支持。例如，如果您需要录制视频，则需要使用外部解决方案。
从 Selenium 中的多个选项卡中抓取数据压力很大。
它不会生成用于调试的执行报告。

使用 Selenium 进行网页抓取

就像我们为 Playwright 所做的那样，让我们使用 Selenium 构建一个简单的网络抓取工具。为此，导入必要的模块并配置 Selenium 实例。通过设置确保无头模式处于活动状态option.headless = True。

# to extract the data from the HTML 
from bs4 import BeautifulSoup 
 
# required selenium modules 
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
# web driver manager: https://github.com/SergeyPirogov/webdriver_manager 
# will help us automatically download the web driver binaries 
# then we can use `Service` to manage the web driver's state. 
from webdriver_manager.chrome import ChromeDriverManager 
 
options = webdriver.ChromeOptions() 
options.headless = True

使用 WebDriverManager 安装网络驱动程序，然后初始化 Chrome 服务并定义驱动程序实例：

# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
chrome_service = Service(chrome_path) 
driver = webdriver.Chrome(service=chrome_service, options=options)

导航到网页并找到div存储国家/地区的元素：

url = "https://www.scrapethissite.com/pages/simple/" 
driver.get(url) 
 
# get the data divs 
countries = driver.find_elements(By.CSS_SELECTOR, "div.country")

定义一个函数来提取数据：

def extract_data(row): 
    name = row.find_element(By.TAG_NAME, "h3").text.strip("n").strip() 
    capital = row.find_element(By.CSS_SELECTOR, "span.country-capital").text 
    population = row.find_element(By.CSS_SELECTOR, "span.country-population").text 
    area = row.find_element(By.CSS_SELECTOR, "span.country-area").text 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area}

应用map函数提取值，然后退出 Web 驱动程序实例：

# process the extracted data 
data = list(map(extract_data, countries)) 
 
driver.quit()

恭喜！以下是运行脚本后的输出结果：

[ 
    {'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area (km sq)': '468.0'}, 
    {'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area (km sq)': '82880.0'}, 
    {'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area (km sq)': '647500.0'}, 
    {'name': 'Antigua and Barbuda', 'capital': "St. John's", 'population': '86754', 'area (km sq)': '443.0'}, 
    {'name': 'Anguilla', 'capital': 'The Valley', 'population': '13254', 'area (km sq)': '102.0'}, 
    ... 
]

完整代码如下所示：

from selenium import webdriver 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
# web driver manager: https://github.com/SergeyPirogov/webdriver_manager 
# will help us automatically download the web driver binaries 
# then we can use `Service` to manage the web driver's state. 
from webdriver_manager.chrome import ChromeDriverManager 
 
def extract_data(row): 
    name = row.find_element(By.TAG_NAME, "h3").text.strip("n").strip() 
    capital = row.find_element(By.CSS_SELECTOR, "span.country-capital").text 
    population = row.find_element(By.CSS_SELECTOR, "span.country-population").text 
    area = row.find_element(By.CSS_SELECTOR, "span.country-area").text 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area} 
 
options = webdriver.ChromeOptions() 
options.headless = True 
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
# define the chrome service and pass it to the driver instance 
chrome_service = Service(chrome_path) 
driver = webdriver.Chrome(service=chrome_service, options=options) 
 
url = "https://www.scrapethissite.com/pages/simple" 
 
driver.get(url) 
# get the data divs 
countries = driver.find_elements(By.CSS_SELECTOR, "div.country") 
 
# extract the data 
data = list(map(extract_data, countries)) 
 
driver.quit()

哪个更快：Playwright还是Selenium？

如果我们正在谈论 Selenium 与 Playwright 之间的速度比较，那么只有一个答案：Playwright 比 Selenium 快。但是多少钱？

为了比较 Selenium 和 Playwright 之间的速度，我们使用了该time模块并稍微调整了脚本以包括时序计算。我们将start_time = time.time()和添加end_time = time.time()到脚本的顶部和底部，然后用计算差值end_time - start_time。

这是Playwright的脚本：

import time 
from playwright.sync_api import sync_playwright 
 
def extract_data(entry): 
    name = entry.locator("h3").inner_text().strip("n").strip() 
    capital = entry.locator("span.country-capital").inner_text() 
    population = entry.locator("span.country-population").inner_text() 
    area = entry.locator("span.country-area").inner_text() 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area} 
 
start = time.time() 
with sync_playwright() as p: 
    # launch the browser instance and define a new context 
    browser = p.chromium.launch() 
    context = browser.new_context() 
    # open a new tab and go to the website 
    page = context.new_page() 
    page.goto("https://www.scrapethissite.com/pages/") 
    # click to the first page and wait while page loads 
    page.locator("a[href='/pages/simple/']").click() 
    page.wait_for_load_state("load") 
    # get the countries 
    countries = page.locator("div.country") 
    n_countries = countries.count() 
 
    data = [] 
 
    for i in range(n_countries): 
        entry = countries.nth(i) 
        sample = extract_data(entry) 
        data.append(sample) 
 
browser.close() 
end = time.time() 
 
print(f"The whole script took: {end-start:.4f}")

这是用于 Selenium 的脚本：

import time 
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
# web driver manager: https://github.com/SergeyPirogov/webdriver_manager 
# will help us automatically download the web driver binaries 
# then we can use `Service` to manage the web driver's state. 
from webdriver_manager.chrome import ChromeDriverManager 
 
def extract_data(row): 
    name = row.find_element(By.TAG_NAME, "h3").text.strip("n").strip() 
    capital = row.find_element(By.CSS_SELECTOR, "span.country-capital").text 
    population = row.find_element(By.CSS_SELECTOR, "span.country-population").text 
    area = row.find_element(By.CSS_SELECTOR, "span.country-area").text 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area} 
 
# start the timer 
start = time.time() 
 
options = webdriver.ChromeOptions() 
options.headless = True 
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
# define the chrome service and pass it to the driver instance 
chrome_service = Service(chrome_path) 
driver = webdriver.Chrome(service=chrome_service, options=options) 
 
url = "https://www.scrapethissite.com/pages/" 
 
driver.get(url) 
# get the first page and click to the link 
first_page = driver.find_element(By.CSS_SELECTOR, "h3.page-title a") 
first_page.click() 
# get the data div and extract the data using beautifulsoup 
countries_container = driver.find_element(By.CSS_SELECTOR, "section#countries div.container") 
countries = driver.find_elements(By.CSS_SELECTOR, "div.country") 
 
# scrape the data using extract_data function 
data = list(map(extract_data, countries)) 
 
end = time.time() 
 
print(f"The whole script took: {end-start:.4f}") 
 
driver.quit()

我们会将这些脚本添加到它们各自的爬虫中，这是运行代码后的结果：

results

你有它！我们从 Playwright 与 Selenium 之间的速度测试中得到的结果表明，Playwright 比 Selenium 快大约 5 倍。

Selenium vs Playwright：哪个更好？

Playwright 和 Selenium 都是出色的自动化工具，能够在正确完成时无缝抓取网页。然而，在选择合适的人选时可能会让人头疼，所以最好的选择取决于你的网络抓取需求、你想要抓取的数据类型、浏览器支持和其他考虑因素。

回顾一下，以下是Selenium 与 Playwright 之间的一些主要区别

Playwright 不支持真实设备，而 Selenium 可用于真实设备和远程服务器。
Playwright 具有内置的并行化支持，而 Selenium 需要第三方工具。
Playwright 的执行速度比 Selenium 快。
Selenium 不支持详细报告和视频录制等功能，而 Playwright 具有内置支持。
Selenium 比 Playwright 支持更多的浏览器。
Selenium 支持更多的编程语言。

可扩展性是使用基于 Playwright 或 Selenium 等框架构建的网络抓取工具的主要难题之一，因为它们可能会触发反机器人证券并被阻止。避免这种情况的最佳方法之一是使用网络抓取 API，例如 ZenRows，它能够在抓取网页时避免反机器人。

Playwright 和Selenium的区别是什么

Playwright

Playwright的优势是什么？

Playwright的劣势是什么？

与 Playwright 一起进行网页抓取

Selenium

Selenium的优点是什么？

Selenium的缺点是什么？

使用 Selenium 进行网页抓取

哪个更快：Playwright还是Selenium？

Selenium vs Playwright：哪个更好？

相关

9个网页抓取的有效技巧

什么是canvas指纹识别以及如何绕过

如何用React Crawling爬取JS生成的网页

如何构建分布式网络爬虫的系统和架构

如何使用Selenium Wire

如何使用Java绕过Cloudflare

Playwright

Playwright的优势是什么？

Playwright的劣势是什么？

与 Playwright 一起进行网页抓取

Selenium

Selenium的优点是什么？

Selenium的缺点是什么？

使用 Selenium 进行网页抓取

哪个更快：Playwright还是Selenium？

Selenium v​​s Playwright：哪个更好？

相关

类似文章

Selenium vs Playwright：哪个更好？