Selenium vs BeautifulSoup:哪个更适合于网页爬取
在 Selenium 和 BeautifulSoup 之间选择网络抓取选项并不是火箭科学。虽然两者都是优秀的库,但在做出此决定时需要考虑一些关键差异,例如编程语言兼容性、浏览器支持和性能。
下表突出显示了 Selenium 和 BeautifulSoup 之间的主要区别:
BeautifulSoup
BeautifulSoup是一个 Python 网络抓取库,用于网络抓取和解析 HTML 和 XML 文档,为我们提供更多选择来浏览结构化数据树。它允许您通过提供简单易用的 API 从网页的 HTML 或 XML 代码中提取信息。
该库可以解析和浏览页面,使查找和提取所需内容变得容易。BeautifulSoup 可以从网页中提取文本、链接、图像和其他元素等数据。
BeautifulSoup有什么优势?
- 它更快。
- 它对初学者友好且更易于设置。
- 它独立于浏览器工作。
- 它需要更少的时间来运行。
- 它可以解析 HTML 和 XML 文档。
- BeautifulSoup 更容易调试。
BeautifulSoup 的缺点是什么?
BeautifulSoup 的缺点是:
- 它不能像人类用户那样与网页交互
- 它只能解析数据。因此,您需要安装其他模块来提取数据,例如requests或httpx。
- 它仅支持 Python。
- 您需要一个不同的模块来抓取 JavaScript 呈现的网页,因为 BeautifulSoup 只允许您浏览 HTML 或 XML 文件。
何时使用 BeautifulSoup
BeautifulSoup 最适合用于涉及从静态 HTML 页面和 XML 文档中解析和提取信息的网络抓取任务。例如,如果您需要从结构简单的网站(如博客或在线商店)抓取数据,BeautifulSoup 可以通过解析 HTML 代码轻松提取您需要的信息。
如果您希望抓取动态内容,Selenium 是更好的选择。
使用 BeautifulSoup 进行网页抓取示例
让我们通过快速抓取教程来获得有关 BeautifulSoup 与 Selenium 之间性能比较的更多见解。
由于 BeautifulSoup 只是提供了一种浏览数据的方法,我们将使用另一个模块来下载网站数据。让我们使用requests
并从维基百科文章中抓取一段:
随机维基百科页面
首先,检查页面元素以找到介绍元素。你会在with类p
下的第二个标签中找到它:div
mv-parser-output
检查维基百科页面
剩下的就是GET
向网站发送请求并定义一个 BeautifulSoup 对象来获取元素。首先导入必要的工具:
# load the required packages from bs4 import BeautifulSoup # we need a module to connect to websites, you can also use built-in urrlib module. import requests url = "https://en.wikipedia.org/wiki/CSS_Baltic" # get the website data response = requests.get(url)
之后,定义对象并解析 HTML。find
然后,使用和方法提取数据find_all
。
find
返回元素的第一次出现。find_all
返回所有找到的元素。
# parse response text using html.parser soup = BeautifulSoup(response.text, "html.parser") # get the main div element main_div = soup.find("div", {"class": "mw-body-content mw-content-ltr"}) # extract the content div content_div = main_div.find("div", {"class": "mw-parser-output"}) # second div is the first paragraph second_p = main_div.find_all("p")[1] # print out the extracted data print(second_p.text)
查看 BeautifulSoup 的完整代码:
# load the required packages from bs4 import BeautifulSoup # we need a module to connect to websites, you can also use built-in urrlib module. import requests url = "https://en.wikipedia.org/wiki/CSS_Baltic" # get the website data response = requests.get(url) # parse response text using html.parser soup = BeautifulSoup(response.text, "html.parser") # get the main div element main_div = soup.find("div", {"class": "mw-body-content mw-content-ltr"}) # extract the content div content_div = main_div.find("div", {"class": "mw-parser-output"}) # second div is the first paragraph second_p = main_div.find_all("p")[1] # print out the extracted data print(second_p.text)
运行脚本后,输出应该如下所示:
CSS[a] Baltic was an ironclad warship that served in the Confederate States Navy during the American Civil War. A towboat before the war, she was purchased by the state of Alabama in December 1861 for conversion into an ironclad. After being transferred to the Confederate Navy in May 1862 as an ironclad, she served on Mobile Bay off the Gulf of Mexico. Baltic's condition in Confederate service was such that naval historian William N. Still Jr. has described her as "a nondescript vessel in many ways".[3] Over the next two years, parts of the ship's wooden structure were affected by wood rot. Her armor was removed to be put onto the ironclad CSS Nashville in 1864. By that August, Baltic had been decommissioned. Near the end of the war, she was taken up the Tombigbee River, where she was captured by Union forces on May 10, 1865. An inspection of Baltic the next month found that her upper hull and deck were rotten and that her boilers were unsafe. She was sold on December 31, and was likely broken up in 1866.
就这样!
尽管 BeautifulSoup 只能抓取静态网页,但也可以通过将其与不同的库结合来提取动态数据。通过将ZenRows API 与 Python Requests 和 BeautifulSoup结合使用,了解如何做到这一点。
Selenium
Selenium是一种开源浏览器自动化工具,通常用于网络抓取。它已经存在了十多年,其主要组件是 Selenium IDE(用于在自动化操作之前记录操作)、Selenium WebDriver(用于在浏览器中执行命令)和 Selenium Grid(用于并行执行)。
Selenium 还可以处理动态网页,这些网页很难使用 BeautifulSoup 抓取。
Selenium的优点是什么?
硒的优点是:
- 它易于使用。
- 它支持多种编程语言,如 JavaScript、Ruby、Python 和 C#。
- 它可以自动化 Firefox、Edge、Safari 甚至自定义 QtWebKit 浏览器。
- Selenium 可以与网页的 JavaScript 代码交互,执行 XHR 请求并在抓取数据之前等待元素加载。换句话说,您可以轻松地抓取动态网页,而使用 BeautifulSoup 检测页面加载后内容发生变化的页面并与之交互更具挑战性。
Selenium的缺点是什么?
硒的缺点是:
- Selenium 设置方法很复杂。
- 与 BeautifulSoup 相比,它使用更多资源。
- 当您开始扩展应用程序时,它可能会变慢。
何时使用Selenium
Selenium 与 BeautifulSoup 之间的一个关键区别在于它们可以抓取的数据类型。Selenium 非常适合抓取需要与页面交互的网站,例如填写表格、单击按钮或在页面之间导航。例如,如果您需要从需要登录的网站上抓取数据,Selenium 可以自动执行登录过程并浏览页面以抓取数据。
此外,Selenium 是用于抓取 JS 呈现的网页的优秀库。
使用 Selenium 的 Web 抓取示例
让我们通过使用同一网页的Selenium进行网络抓取的教程。我们还将导航到有关 Selenium 的文章,以强调动态内容抓取。
首先导入所需的包:
from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By # web driver manager: https://github.com/SergeyPirogov/webdriver_manager # will help us automatically download the web driver binaries # then we can use `Service` to manage the web driver's state. from webdriver_manager.chrome import ChromeDriverManager # we will need these to wait for dynamic content to load from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC
然后,使用 WebDriver 的选项对象配置 Selenium 实例:
# we can configure the selenium using webdriver's options object options = webdriver.ChromeOptions() options.headless = True # just set it to the headless mode # this returns the path web driver downloaded chrome_path = ChromeDriverManager().install() # define the chrome service and pass it to the driver instance chrome_service = Service(chrome_path) driver = webdriver.Chrome(service=chrome_service, options=options)
在这种情况下,我们将转到维基百科的主页并使用搜索栏。这将显示 Selenium 与页面交互和抓取动态内容的能力。当我们检查元素时,我们可以看到搜索栏是一个带有vector-search-box-input
类的输入元素。
维基百科页面上的搜索框
您需要单击输入法,然后编写查询。为此,Selenium 提供了send_keys
为我们填写表格的功能。
# find the search box search_box = driver.find_element(By.CSS_SELECTOR, "input.vector-search-box-input") # click to the search box search_box.click() # search for the article search_box.send_keys("CSS Baltic")
下一步是将 Selenium 连接到网页,您可以通过与网站交互来完成此操作。找到搜索栏并搜索文章标题。您会发现搜索结果存储为类a
的标签mw-searchSuggest-link
:
维基百科页面上的标题选择器
点击第一个结果提取数据,使用WebDriverWait()
等待网页加载的方法:
try: # wait for 10 seconds for content to load. search_suggestions = WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a.mw-searchSuggest-link")) ) # click to the first suggestion search_suggestions[0].click() # extract the data using same selectors as in beautiful soup. main_div = driver.find_element(By.CSS_SELECTOR, "div.mw-body-content") content_div = main_div.find_element(By.CSS_SELECTOR, "div.mw-parser-output") paragraphs = content_div.find_elements(By.TAG_NAME, "p") # we need the second paragraph intro = paragraphs[1].text print(intro) except Exception as error: print(error)
以下是合并所有块后代码的外观:
from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By # web driver manager: https://github.com/SergeyPirogov/webdriver_manager # will help us automatically download the web driver binaries # then we can use `Service` to manage the web driver's state. from webdriver_manager.chrome import ChromeDriverManager # we will need these to wait for dynamic content to load from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # we can configure the selenium using webdriver's options object options = webdriver.ChromeOptions() options.headless = True # just set it to the headless mode # this returns the path web driver downloaded chrome_path = ChromeDriverManager().install() # define the chrome service and pass it to the driver instance chrome_service = Service(chrome_path) driver = webdriver.Chrome(service=chrome_service, options=options) url = "https://en.wikipedia.org/wiki/Main_Page" driver.get(url) # find the search box search_box = driver.find_element(By.CSS_SELECTOR, "input.vector-search-box-input") # click to the search box search_box.click() # search for the article search_box.send_keys("CSS Baltic") try: # wait for 10 seconds for content to load. search_suggestions = WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a.mw-searchSuggest-link")) ) # click to the first suggestion search_suggestions[0].click() # extract the data using same selectors as in beautiful soup. main_div = driver.find_element(By.CSS_SELECTOR, "div.mw-body-content") content_div = main_div.find_element(By.CSS_SELECTOR, "div.mw-parser-output") paragraphs = content_div.find_elements(By.TAG_NAME, "p") # we need the second paragraph intro = paragraphs[1].text print(intro) except Exception as error: print(error) driver.quit()
看一下输出:
CSS[a] Baltic was an ironclad warship that served in the Confederate States Navy during the American Civil War. A towboat before the war, she was purchased by the state of Alabama in December 1861 for conversion into an ironclad. After being transferred to the Confederate Navy in May 1862 as an ironclad, she served on Mobile Bay off the Gulf of Mexico. Baltic's condition in Confederate service was such that naval historian William N. Still Jr. has described her as "a nondescript vessel in many ways".[3] Over the next two years, parts of the ship's wooden structure were affected by wood rot. Her armor was removed to be put onto the ironclad CSS Nashville in 1864. By that August, Baltic had been decommissioned. Near the end of the war, she was taken up the Tombigbee River, where she was captured by Union forces on May 10, 1865. An inspection of Baltic the next month found that her upper hull and deck were rotten and that her boilers were unsafe. She was sold on December 31, and was likely broken up in 1866.
主要区别:Selenium 与 BeautifulSoup
让我们分析关键的考虑因素以在两个库之间做出决定:
- 功能。
- 速度。
- 使用方便。
功能性
Selenium 是一种 Web 浏览器自动化工具,可以像人类用户一样与网页交互,而 BeautifulSoup 是一个用于解析 HTML 和 XML 文档的库。这意味着Selenium 具有更多功能,因为它可以自动执行浏览器操作,例如单击按钮、填写表单和在页面之间导航。BeautifulSoup 功能比较有限,主要用于解析和提取数据。
速度
BeautifulSoup 和 Selenium 哪个库更快?您不是第一个提出这个问题并强调它的人:BeautifulSoup 比 Selenium 更快,因为它不需要实际的浏览器实例。
为了比较 Selenium 与 BeautifulSoup 的速度,我们使用了ScrapeThisSite,运行上面显示的脚本 1,000 次并使用条形图绘制结果。
我们使用os
Python 中的模块来运行脚本和time
计算时差的模块。我们在命令之间定义了t0=time.time()
,并存储了差异。结果保存在 Pandas 数据框中。t1=time.time()
t2=time.time()
os
import os import time import matplotlib.pyplot as plt import pandas as pd d = { "selenium": [], "bs4": [] } N = 1000 for i in range(N): print("-"*20, f"Experiment {i+1}", "-"*20) t0 = time.time() os.system("python3 'beautifulsoup_script.py'") t1 = time.time() os.system("python3 'selenium_script.py'") t2 = time.time() d["selenium"].append(t2-t1) d["bs4"].append(t1-t0) df = pd.DataFrame(d) df.to_csv("data.csv", index=False)
这是运行代码后 Selenium vs Puppeteer 的测试结果:
硒与 bs4
结果表明BeautifulSoup 比 Selenium 快 70% 左右。因此,关于这个特定标准的最佳选择是 BeautifulSoup。
使用方便
BeautifulSoup 比 Selenium 更易于使用。它有一个简单的 API,对于初学者来说很容易理解。另一方面,Selenium 的设置和使用可能更复杂,因为它需要了解 Web 驱动程序和浏览器自动化等编程概念。
哪个更好:Selenium 与 BeautifulSoup
Selenium 和 BeautifulSoup 哪个更适合抓取并没有直接的答案,因为它取决于您的网络抓取需求、长期库支持和跨浏览器支持等因素。BeautifulSoup 速度很快,但与 Selenium 相比,它支持的编程语言更少,并且只能抓取静态网页。
BeautifulSoup 和 Selenium 无疑是用于抓取的优秀库,但在大规模网络抓取或抓取热门网站时会让人头疼,因为保护措施可能会检测到您的机器人。避免这种情况的最好方法是使用像 ZenRows 这样的网络抓取 API。