Playwright 和Selenium的区别是什么
在 Playwright 与 Selenium 之间进行网络抓取选择时迷失方向并不奇怪,因为两者都是流行的开源自动化工具。
考虑您的抓取需求和标准很重要,例如兼容的语言、文档和浏览器支持。
让我们进入细节。我们将讨论它们的优缺点,以及一个关于如何使用 Playwright 和 Selenium 抓取网页的真实示例。
Playwright
Playwright是 Microsoft 开发的端到端 Web 测试和自动化库。尽管该框架的主要作用是测试 Web 应用程序,但也可以将其用于 Web 抓取目的。
Playwright的优势是什么?
Playwright的优势是:
- 它支持所有现代渲染引擎,包括 Chromium、WebKit 和 Firefox。
- Playwright 可以在 Windows、Linux、macOS 或 CI 上使用。
- 它支持 TypeScript、JavaScript (NodeJS)、Python、.NET 和 Java。
- Playwright 的执行速度比 Selenium 的快。
- Playwright 支持自动等待并对元素进行相关检查。
- 您可以生成检查网页的选择器,并通过记录您的操作来生成场景。
- Playwright 支持同时执行,也可以阻止不必要的资源请求。
Playwright的劣势是什么?
Playwright的优势是:
- 它只能处理模拟器,不能处理真实设备。
- 与 Selenium 相比,Playwright 没有很大的社区。
- 它不适用于旧版浏览器和设备。
与 Playwright 一起进行网页抓取
让我们来看一个快速的 Playwright 网络抓取教程,以比较 Playwright 与 Selenium 的抓取能力。我们将从Scrape This Site的第一页中提取 250 个表项。
首先导入所需的包并初始化浏览器实例:
from bs4 import BeautifulSoup from playwright.sync_api import sync_playwright with sync_playwright() as p: # launch the browser instance and define a new context browser = p.chromium.launch() context = browser.new_context()
使用以下方法导航到目标网页page.goto()
:
page = context.new_page() page.goto("https://www.scrapethissite.com/pages/simple/")
由于每个表条目都在一个div
类中country
,因此使用 CSS 类选择器使用该page.locator()
方法定位 div 元素。此外,存储匹配元素的数量以便稍后循环:
countries = page.locator("div.country") n_countries = countries.count()
下一步是使用该extract_data()
方法提取姓名、首都、人口和地区。像这样:
def extract_data(entry): name = entry.locator("h3").inner_text().strip("n").strip() capital = entry.locator("span.country-capital").inner_text() population = entry.locator("span.country-population").inner_text() area = entry.locator("span.country-area").inner_text() return {"name": name, "capital": capital, "population": population, "area (km sq)": area}
使用函数提取数据extract_data
,然后关闭浏览器实例:
data = [] for i in range(n_countries): entry = countries.nth(i) sample = extract_data(entry) data.append(sample) browser.close()
恭喜!您已使用 Playwright 成功抓取网页。您的输出应该如下所示:
[ {'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area (km sq)': '468.0'}, {'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area (km sq)': '82880.0'}, {'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area (km sq)': '647500.0'}, {'name': 'Antigua and Barbuda', 'capital': "St. John's", 'population': '86754', 'area (km sq)': '443.0'}, {'name': 'Anguilla', 'capital': 'The Valley', 'population': '13254', 'area (km sq)': '102.0'}, ... ]
如果您在任何时候迷路了,这是完整的 Playwright 代码:
from playwright.sync_api import sync_playwright def extract_data(entry): name = entry.locator("h3").inner_text().strip("n").strip() capital = entry.locator("span.country-capital").inner_text() population = entry.locator("span.country-population").inner_text() area = entry.locator("span.country-area").inner_text() return {"name": name, "capital": capital, "population": population, "area (km sq)": area} with sync_playwright() as p: # launch the browser instance and define a new context browser = p.chromium.launch() context = browser.new_context() # open a new tab and go to the website page = context.new_page() page.goto("https://www.scrapethissite.com/pages/simple/") page.wait_for_load_state("load") # get the countries countries = page.locator("div.country") n_countries = countries.count() # loop through the elements and scrape the data data = [] for i in range(n_countries): entry = countries.nth(i) sample = extract_data(entry) data.append(sample) browser.close()
Selenium
Selenium 是用于网络抓取和网络自动化的最流行的开源工具之一。在使用 Selenium 进行抓取时,您可以自动化浏览器、与 UI 元素交互并在 Web 应用程序上模仿用户操作。Selenium 的一些核心组件包括 WebDriver、Selenium IDE 和 Selenium Grid。
Selenium的优点是什么?
Selenium的优点是:
- 它易于使用。
- 它可以通过使用 Appium 自动化大量浏览器,包括 IE、移动浏览器甚至移动应用程序。
- 它支持多种编程语言,如 Java、C#、Python、Perl、JavaScript 和 Ruby。
- 它可以在 Windows、macOS 和 Linux 上运行。
Selenium的缺点是什么?
Selenium的缺点是:
- 与Playwright相比,Selenium需要第三方工具来实现并行执行。
- 没有内置的报告支持。例如,如果您需要录制视频,则需要使用外部解决方案。
- 从 Selenium 中的多个选项卡中抓取数据压力很大。
- 它不会生成用于调试的执行报告。
使用 Selenium 进行网页抓取
就像我们为 Playwright 所做的那样,让我们使用 Selenium 构建一个简单的网络抓取工具。为此,导入必要的模块并配置 Selenium 实例。通过设置 确保无头模式处于活动状态option.headless = True
。
# to extract the data from the HTML from bs4 import BeautifulSoup # required selenium modules from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By # web driver manager: https://github.com/SergeyPirogov/webdriver_manager # will help us automatically download the web driver binaries # then we can use `Service` to manage the web driver's state. from webdriver_manager.chrome import ChromeDriverManager options = webdriver.ChromeOptions() options.headless = True
使用 WebDriverManager 安装网络驱动程序,然后初始化 Chrome 服务并定义驱动程序实例:
# this returns the path web driver downloaded chrome_path = ChromeDriverManager().install() chrome_service = Service(chrome_path) driver = webdriver.Chrome(service=chrome_service, options=options)
导航到网页并找到div
存储国家/地区的元素:
url = "https://www.scrapethissite.com/pages/simple/" driver.get(url) # get the data divs countries = driver.find_elements(By.CSS_SELECTOR, "div.country")
定义一个函数来提取数据:
def extract_data(row): name = row.find_element(By.TAG_NAME, "h3").text.strip("n").strip() capital = row.find_element(By.CSS_SELECTOR, "span.country-capital").text population = row.find_element(By.CSS_SELECTOR, "span.country-population").text area = row.find_element(By.CSS_SELECTOR, "span.country-area").text return {"name": name, "capital": capital, "population": population, "area (km sq)": area}
应用map
函数提取值,然后退出 Web 驱动程序实例:
# process the extracted data data = list(map(extract_data, countries)) driver.quit()
恭喜!以下是运行脚本后的输出结果:
[ {'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area (km sq)': '468.0'}, {'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area (km sq)': '82880.0'}, {'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area (km sq)': '647500.0'}, {'name': 'Antigua and Barbuda', 'capital': "St. John's", 'population': '86754', 'area (km sq)': '443.0'}, {'name': 'Anguilla', 'capital': 'The Valley', 'population': '13254', 'area (km sq)': '102.0'}, ... ]
完整代码如下所示:
from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By # web driver manager: https://github.com/SergeyPirogov/webdriver_manager # will help us automatically download the web driver binaries # then we can use `Service` to manage the web driver's state. from webdriver_manager.chrome import ChromeDriverManager def extract_data(row): name = row.find_element(By.TAG_NAME, "h3").text.strip("n").strip() capital = row.find_element(By.CSS_SELECTOR, "span.country-capital").text population = row.find_element(By.CSS_SELECTOR, "span.country-population").text area = row.find_element(By.CSS_SELECTOR, "span.country-area").text return {"name": name, "capital": capital, "population": population, "area (km sq)": area} options = webdriver.ChromeOptions() options.headless = True # this returns the path web driver downloaded chrome_path = ChromeDriverManager().install() # define the chrome service and pass it to the driver instance chrome_service = Service(chrome_path) driver = webdriver.Chrome(service=chrome_service, options=options) url = "https://www.scrapethissite.com/pages/simple" driver.get(url) # get the data divs countries = driver.find_elements(By.CSS_SELECTOR, "div.country") # extract the data data = list(map(extract_data, countries)) driver.quit()
哪个更快:Playwright还是Selenium?
如果我们正在谈论 Selenium 与 Playwright 之间的速度比较,那么只有一个答案:Playwright 比 Selenium 快。但是多少钱?
为了比较 Selenium 和 Playwright 之间的速度,我们使用了该time
模块并稍微调整了脚本以包括时序计算。我们将start_time = time.time()
和添加end_time = time.time()
到脚本的顶部和底部,然后用 计算差值end_time - start_time
。
这是Playwright的脚本:
import time from playwright.sync_api import sync_playwright def extract_data(entry): name = entry.locator("h3").inner_text().strip("n").strip() capital = entry.locator("span.country-capital").inner_text() population = entry.locator("span.country-population").inner_text() area = entry.locator("span.country-area").inner_text() return {"name": name, "capital": capital, "population": population, "area (km sq)": area} start = time.time() with sync_playwright() as p: # launch the browser instance and define a new context browser = p.chromium.launch() context = browser.new_context() # open a new tab and go to the website page = context.new_page() page.goto("https://www.scrapethissite.com/pages/") # click to the first page and wait while page loads page.locator("a[href='/pages/simple/']").click() page.wait_for_load_state("load") # get the countries countries = page.locator("div.country") n_countries = countries.count() data = [] for i in range(n_countries): entry = countries.nth(i) sample = extract_data(entry) data.append(sample) browser.close() end = time.time() print(f"The whole script took: {end-start:.4f}")
这是用于 Selenium 的脚本:
import time from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By # web driver manager: https://github.com/SergeyPirogov/webdriver_manager # will help us automatically download the web driver binaries # then we can use `Service` to manage the web driver's state. from webdriver_manager.chrome import ChromeDriverManager def extract_data(row): name = row.find_element(By.TAG_NAME, "h3").text.strip("n").strip() capital = row.find_element(By.CSS_SELECTOR, "span.country-capital").text population = row.find_element(By.CSS_SELECTOR, "span.country-population").text area = row.find_element(By.CSS_SELECTOR, "span.country-area").text return {"name": name, "capital": capital, "population": population, "area (km sq)": area} # start the timer start = time.time() options = webdriver.ChromeOptions() options.headless = True # this returns the path web driver downloaded chrome_path = ChromeDriverManager().install() # define the chrome service and pass it to the driver instance chrome_service = Service(chrome_path) driver = webdriver.Chrome(service=chrome_service, options=options) url = "https://www.scrapethissite.com/pages/" driver.get(url) # get the first page and click to the link first_page = driver.find_element(By.CSS_SELECTOR, "h3.page-title a") first_page.click() # get the data div and extract the data using beautifulsoup countries_container = driver.find_element(By.CSS_SELECTOR, "section#countries div.container") countries = driver.find_elements(By.CSS_SELECTOR, "div.country") # scrape the data using extract_data function data = list(map(extract_data, countries)) end = time.time() print(f"The whole script took: {end-start:.4f}") driver.quit()
我们会将这些脚本添加到它们各自的爬虫中,这是运行代码后的结果:
你有它!我们从 Playwright 与 Selenium 之间的速度测试中得到的结果表明,Playwright 比 Selenium 快大约 5 倍。
Selenium vs Playwright:哪个更好?
Playwright 和 Selenium 都是出色的自动化工具,能够在正确完成时无缝抓取网页。然而,在选择合适的人选时可能会让人头疼,所以最好的选择取决于你的网络抓取需求、你想要抓取的数据类型、浏览器支持和其他考虑因素。
回顾一下,以下是Selenium 与 Playwright 之间的一些主要区别
- Playwright 不支持真实设备,而 Selenium 可用于真实设备和远程服务器。
- Playwright 具有内置的并行化支持,而 Selenium 需要第三方工具。
- Playwright 的执行速度比 Selenium 快。
- Selenium 不支持详细报告和视频录制等功能,而 Playwright 具有内置支持。
- Selenium 比 Playwright 支持更多的浏览器。
- Selenium 支持更多的编程语言。
可扩展性是使用基于 Playwright 或 Selenium 等框架构建的网络抓取工具的主要难题之一,因为它们可能会触发反机器人证券并被阻止。避免这种情况的最佳方法之一是使用网络抓取 API,例如 ZenRows,它能够在抓取网页时避免反机器人。