Playwright 和Selenium的区别是什么

在 Playwright 与 Selenium 之间进行网络抓取选择时迷失方向并不奇怪,因为两者都是流行的开源自动化工具。

考虑您的抓取需求和标准很重要,例如兼容的语言、文档和浏览器支持。

让我们进入细节。我们将讨论它们的优缺点,以及一个关于如何使用 Playwright 和 Selenium 抓取网页的真实示例。

Playwright

Playwright是 Microsoft 开发的端到端 Web 测试和自动化库。尽管该框架的主要作用是测试 Web 应用程序,但也可以将其用于 Web 抓取目的。

Playwright的优势是什么?

Playwright的优势是:

  • 它支持所有现代渲染引擎,包括 Chromium、WebKit 和 Firefox。
  • Playwright 可以在 Windows、Linux、macOS 或 CI 上使用。
  • 它支持 TypeScript、JavaScript (NodeJS)、Python、.NET 和 Java。
  • Playwright 的执行速度比 Selenium 的快。
  • Playwright 支持自动等待并对元素进行相关检查。
  • 您可以生成检查网页的选择器,并通过记录您的操作来生成场景。
  • Playwright 支持同时执行,也可以阻止不必要的资源请求

Playwright的劣势是什么?

Playwright的优势是:

  • 它只能处理模拟器,不能处理真实设备。
  • 与 Selenium 相比,Playwright 没有很大的社区。
  • 它不适用于旧版浏览器和设备。

与 Playwright 一起进行网页抓取

让我们来看一个快速的 Playwright 网络抓取教程,以比较 Playwright 与 Selenium 的抓取能力。我们将从Scrape This Site的第一页中提取 250 个表项。

首先导入所需的包并初始化浏览器实例:

first_page

from bs4 import BeautifulSoup 
from playwright.sync_api import sync_playwright 
 
with sync_playwright() as p: 
    # launch the browser instance and define a new context 
    browser = p.chromium.launch() 
    context = browser.new_context()

使用以下方法导航到目标网页page.goto()

page = context.new_page() 
page.goto("https://www.scrapethissite.com/pages/simple/")

由于每个表条目都在一个div类中country,因此使用 CSS 类选择器使用该page.locator()方法定位 div 元素。此外,存储匹配元素的数量以便稍后循环:

countries = page.locator("div.country") 
n_countries = countries.count()

scrapethissite_countries

下一步是使用该extract_data()方法提取姓名、首都、人口和地区。像这样:

def extract_data(entry): 
    name = entry.locator("h3").inner_text().strip("n").strip() 
    capital = entry.locator("span.country-capital").inner_text() 
    population = entry.locator("span.country-population").inner_text() 
    area = entry.locator("span.country-area").inner_text() 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area}

使用函数提取数据extract_data,然后关闭浏览器实例:

data = [] 
 
for i in range(n_countries): 
    entry = countries.nth(i) 
    sample = extract_data(entry) 
    data.append(sample) 
 
browser.close()

恭喜!您已使用 Playwright 成功抓取网页。您的输出应该如下所示:

[ 
    {'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area (km sq)': '468.0'}, 
    {'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area (km sq)': '82880.0'}, 
    {'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area (km sq)': '647500.0'}, 
    {'name': 'Antigua and Barbuda', 'capital': "St. John's", 'population': '86754', 'area (km sq)': '443.0'}, 
    {'name': 'Anguilla', 'capital': 'The Valley', 'population': '13254', 'area (km sq)': '102.0'}, 
    ... 
]

如果您在任何时候迷路了,这是完整的 Playwright 代码:

from playwright.sync_api import sync_playwright 
 
def extract_data(entry): 
    name = entry.locator("h3").inner_text().strip("n").strip() 
    capital = entry.locator("span.country-capital").inner_text() 
    population = entry.locator("span.country-population").inner_text() 
    area = entry.locator("span.country-area").inner_text() 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area} 
 
with sync_playwright() as p: 
    # launch the browser instance and define a new context 
    browser = p.chromium.launch() 
    context = browser.new_context() 
    # open a new tab and go to the website 
    page = context.new_page() 
    page.goto("https://www.scrapethissite.com/pages/simple/") 
    page.wait_for_load_state("load") 
    # get the countries 
    countries = page.locator("div.country") 
    n_countries = countries.count() 
 
    # loop through the elements and scrape the data 
    data = [] 
 
    for i in range(n_countries): 
        entry = countries.nth(i) 
        sample = extract_data(entry) 
        data.append(sample) 
 
browser.close()

Selenium

Selenium 是用于网络抓取和网络自动化的最流行的开源工具之一。在使用 Selenium 进行抓取时,您可以自动化浏览器、与 UI 元素交互并在 Web 应用程序上模仿用户操作。Selenium 的一些核心组件包括 WebDriver、Selenium IDE 和 Selenium Grid。

Selenium的优点是什么?

Selenium的优点是:

  • 它易于使用。
  • 它可以通过使用 Appium 自动化大量浏览器,包括 IE、移动浏览器甚至移动应用程序。
  • 它支持多种编程语言,如 Java、C#、Python、Perl、JavaScript 和 Ruby。
  • 它可以在 Windows、macOS 和 Linux 上运行。

Selenium的缺点是什么?

Selenium的缺点是:

  • 与Playwright相比,Selenium需要第三方工具来实现并行执行。
  • 没有内置的报告支持。例如,如果您需要录制视频,则需要使用外部解决方案。
  • 从 Selenium 中的多个选项卡中抓取数据压力很大。
  • 它不会生成用于调试的执行报告。

使用 Selenium 进行网页抓取

就像我们为 Playwright 所做的那样,让我们​​使用 Selenium 构建一个简单的网络抓取工具。为此,导入必要的模块并配置 Selenium 实例。通过设置 确保无头模式处于活动状态option.headless = True

# to extract the data from the HTML 
from bs4 import BeautifulSoup 
 
# required selenium modules 
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
# web driver manager: https://github.com/SergeyPirogov/webdriver_manager 
# will help us automatically download the web driver binaries 
# then we can use `Service` to manage the web driver's state. 
from webdriver_manager.chrome import ChromeDriverManager 
 
options = webdriver.ChromeOptions() 
options.headless = True

使用 WebDriverManager 安装网络驱动程序,然后初始化 Chrome 服务并定义驱动程序实例:

# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
chrome_service = Service(chrome_path) 
driver = webdriver.Chrome(service=chrome_service, options=options)

导航到网页并找到div存储国家/地区的元素:

url = "https://www.scrapethissite.com/pages/simple/" 
driver.get(url) 
 
# get the data divs 
countries = driver.find_elements(By.CSS_SELECTOR, "div.country")

定义一个函数来提取数据:

def extract_data(row): 
    name = row.find_element(By.TAG_NAME, "h3").text.strip("n").strip() 
    capital = row.find_element(By.CSS_SELECTOR, "span.country-capital").text 
    population = row.find_element(By.CSS_SELECTOR, "span.country-population").text 
    area = row.find_element(By.CSS_SELECTOR, "span.country-area").text 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area}

应用map函数提取值,然后退出 Web 驱动程序实例:

# process the extracted data 
data = list(map(extract_data, countries)) 
 
driver.quit()

恭喜!以下是运行脚本后的输出结果:

[ 
    {'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area (km sq)': '468.0'}, 
    {'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area (km sq)': '82880.0'}, 
    {'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area (km sq)': '647500.0'}, 
    {'name': 'Antigua and Barbuda', 'capital': "St. John's", 'population': '86754', 'area (km sq)': '443.0'}, 
    {'name': 'Anguilla', 'capital': 'The Valley', 'population': '13254', 'area (km sq)': '102.0'}, 
    ... 
]

完整代码如下所示:

from selenium import webdriver 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
# web driver manager: https://github.com/SergeyPirogov/webdriver_manager 
# will help us automatically download the web driver binaries 
# then we can use `Service` to manage the web driver's state. 
from webdriver_manager.chrome import ChromeDriverManager 
 
def extract_data(row): 
    name = row.find_element(By.TAG_NAME, "h3").text.strip("n").strip() 
    capital = row.find_element(By.CSS_SELECTOR, "span.country-capital").text 
    population = row.find_element(By.CSS_SELECTOR, "span.country-population").text 
    area = row.find_element(By.CSS_SELECTOR, "span.country-area").text 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area} 
 
options = webdriver.ChromeOptions() 
options.headless = True 
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
# define the chrome service and pass it to the driver instance 
chrome_service = Service(chrome_path) 
driver = webdriver.Chrome(service=chrome_service, options=options) 
 
url = "https://www.scrapethissite.com/pages/simple" 
 
driver.get(url) 
# get the data divs 
countries = driver.find_elements(By.CSS_SELECTOR, "div.country") 
 
# extract the data 
data = list(map(extract_data, countries)) 
 
driver.quit()

哪个更快:Playwright还是Selenium?

如果我们正在谈论 Selenium 与 Playwright 之间的速度比较,那么只有一个答案:Playwright 比 Selenium 快。但是多少钱?

为了比较 Selenium 和 Playwright 之间的速度,我们使用了该time模块并稍微调整了脚本以包括时序计算。我们将start_time = time.time()和添加end_time = time.time()到脚本的顶部和底部,然后用 计算差值end_time - start_time

这是Playwright的脚本:

import time 
from playwright.sync_api import sync_playwright 
 
def extract_data(entry): 
    name = entry.locator("h3").inner_text().strip("n").strip() 
    capital = entry.locator("span.country-capital").inner_text() 
    population = entry.locator("span.country-population").inner_text() 
    area = entry.locator("span.country-area").inner_text() 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area} 
 
start = time.time() 
with sync_playwright() as p: 
    # launch the browser instance and define a new context 
    browser = p.chromium.launch() 
    context = browser.new_context() 
    # open a new tab and go to the website 
    page = context.new_page() 
    page.goto("https://www.scrapethissite.com/pages/") 
    # click to the first page and wait while page loads 
    page.locator("a[href='/pages/simple/']").click() 
    page.wait_for_load_state("load") 
    # get the countries 
    countries = page.locator("div.country") 
    n_countries = countries.count() 
 
    data = [] 
 
    for i in range(n_countries): 
        entry = countries.nth(i) 
        sample = extract_data(entry) 
        data.append(sample) 
 
browser.close() 
end = time.time() 
 
print(f"The whole script took: {end-start:.4f}")

这是用于 Selenium 的脚本:

import time 
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
# web driver manager: https://github.com/SergeyPirogov/webdriver_manager 
# will help us automatically download the web driver binaries 
# then we can use `Service` to manage the web driver's state. 
from webdriver_manager.chrome import ChromeDriverManager 
 
def extract_data(row): 
    name = row.find_element(By.TAG_NAME, "h3").text.strip("n").strip() 
    capital = row.find_element(By.CSS_SELECTOR, "span.country-capital").text 
    population = row.find_element(By.CSS_SELECTOR, "span.country-population").text 
    area = row.find_element(By.CSS_SELECTOR, "span.country-area").text 
 
    return {"name": name, "capital": capital, "population": population, "area (km sq)": area} 
 
# start the timer 
start = time.time() 
 
options = webdriver.ChromeOptions() 
options.headless = True 
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
# define the chrome service and pass it to the driver instance 
chrome_service = Service(chrome_path) 
driver = webdriver.Chrome(service=chrome_service, options=options) 
 
url = "https://www.scrapethissite.com/pages/" 
 
driver.get(url) 
# get the first page and click to the link 
first_page = driver.find_element(By.CSS_SELECTOR, "h3.page-title a") 
first_page.click() 
# get the data div and extract the data using beautifulsoup 
countries_container = driver.find_element(By.CSS_SELECTOR, "section#countries div.container") 
countries = driver.find_elements(By.CSS_SELECTOR, "div.country") 
 
# scrape the data using extract_data function 
data = list(map(extract_data, countries)) 
 
end = time.time() 
 
print(f"The whole script took: {end-start:.4f}") 
 
driver.quit()

我们会将这些脚本添加到它们各自的爬虫中,这是运行代码后的结果:

results

你有它!我们从 Playwright 与 Selenium 之间的速度测试中得到的结果表明,Playwright 比 Selenium 快大约 5 倍。

Selenium v​​s Playwright:哪个更好?

Playwright 和 Selenium 都是出色的自动化工具,能够在正确完成时无缝抓取网页。然而,在选择合适的人选时可能会让人头疼,所以最好的选择取决于你的网络抓取需求、你想要抓取的数据类型、浏览器支持和其他考虑因素。

回顾一下,以下是Selenium 与 Playwright 之间的一些主要区别

  • Playwright 不支持真实设备,而 Selenium 可用于真实设备和远程服务器。
  • Playwright 具有内置的并行化支持,而 Selenium 需要第三方工具。
  • Playwright 的执行速度比 Selenium 快。
  • Selenium 不支持详细报告和视频录制等功能,而 Playwright 具有内置支持。
  • Selenium 比 Playwright 支持更多的浏览器。
  • Selenium 支持更多的编程语言。

可扩展性是使用基于 Playwright 或 Selenium 等框架构建的网络抓取工具的主要难题之一,因为它们可能会触发反机器人证券并被阻止。避免这种情况的最佳方法之一是使用网络抓取 API,例如 ZenRows,它能够在抓取网页时避免反机器人。

类似文章