如何用Python在Selenium中以编程方式使火狐无头化

Firefox 仍然是 2023 年最流行的网络浏览器之一，并带有一个有用的工具来帮助网络抓取：Firefox 的无头模式。

在本教程中，我们将介绍何时使用以及如何在 Python 中使用 Selenium 运行无头 Firefox。

什么是无头 Firefox？

Headless Firefox 本质上意味着我们不会以常规方式使用浏览器。相反，我们将使用 webdriver 工具在没有用户界面的情况下操作它。这就是无头运行 Firefox 和正常运行 Firefox 之间的区别。

什么是无头模式浏览器？让我们谈谈好处

没有图形用户界面 (GUI) 的 Web 浏览器客户端称为无头浏览器，用于辅助脚本或机器人。

操作网络浏览器是大多数爬虫的需要，因为您通常需要滚动、填写表格和执行类似的操作。此外，它还可以帮助您节省机器资源，尤其是在执行大型任务时。

Firefox 可以无头运行吗？如何

是的，它可以。一些 Web 浏览器自动化工具支持 Firefox 浏览器，如Selenium，为我们提供 Firefox 的无头网络驱动程序以保护与浏览器的连接。

Chrome 是最常用的无头运行浏览器，因为它拥有更大的生态系统和更多的自动化工具，例如 undetected_chromedriver 和puppeteer-extra-plugin-stealth。然而，Firefox 被用作 Chrome 的替代品，用于在无头模式下运行浏览器，因为它有大量使用 Python 或其他语言实现相同目的的工具。

最流行的无头运行 Firefox 的工具是Selenium和Playwright。我们还会提到Puppeteer，因为它在其他语言中被广泛采用，但它只为该浏览器提供实验性支持。

如何启动 Firefox Headless？

开始无头运行 Firefox 的先决条件是安装Python和 Selenium（我们将在本教程中使用的库）。您还需要确保在本地计算机上安装了 Firefox 浏览器。

pip install selenium

完成后，使用您选择的代码编辑器，我们将使用ScrapMe作为目标 URL 并编写您接下来将看到的脚本，但让我们先了解发生了什么：

Selenium 的默认 Firefox webdriver 加载浏览器，接口允许我们通过作为参数Options传入在后台运行它。headless加载目标网站后，webdriver 将重定向浏览器以打印当前 URL 和标题。

from selenium import webdriver 
from selenium.webdriver.firefox.options import Options 
 
# the target website 
url = "https://scrapeme.live/shop/" 
 
# the interface for turning on headless mode 
options = Options() 
options.add_argument("-headless") 
 
# using Firefox headless webdriver to secure connection to Firefox 
with webdriver.Firefox(options=options) as driver: 
    # opening the target website in the browser 
    driver.get(url) 
 
    #printing the target website url and title 
    print(driver.current_url) # https://scrapeme.live/shop/ 
    print(driver.title) # Products - ScrapeMe

要在正常模式下运行 Firefox，您只需注释掉或删除 headless 选项，如下所示：

# ... 
url = "https://scrapeme.live/shop/" 
 
# options = Options() 
# options.add_argument("-headless") 
 
with webdriver.Firefox() as driver: 
# ...

在我们的例子中，我们运行 Firefox 无头！

scrapeme.live 商店页面

运行 Firefox Headless 给我带来了什么？

无头运行浏览器应该会得到与使用 GUI 在正常模式下运行它相同的结果。

例外情况是，如果您在网站上运行跨浏览器测试，并且最初期望您的网站在不同的浏览器环境中表现不同。但大多数网站都经过优化，因此它们在所有主要浏览器中的行为方式相同，以确保良好的用户体验。

假设我们对ScrapMe 商店的产品信息感兴趣。具体在页面上列出的产品名称和价格。

在检查中，我们发现了三页元素，使我们能够提取这些数据集：

检查 Scrapeme 商店产品

首先，包含产品名称和价格的父元素：

<a href="https://scrapeme.live/shop/" class="woocommerce-LoopProduct-link woocommerce-loop-product__link"> ... </a>

产品名称是h2元素：

<h2 class="woocommerce-loop-product__title"> ... </h2>

产品价格span要素：

<span class="woocommerce-Price-amount amount"><span class="woocommerce-Price-currencySymbol">£</span> ... </span>

为了从页面中提取信息，我们将通过选择元素来使用 Selenium 使用 XPath：

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.firefox.options import Options 
 
# the target website 
url = "https://scrapeme.live/shop/" 
 
# the interface for turning on headless mode 
options = Options() 
options.add_argument("-headless") 
 
# using Firefox headless webdriver to secure connection to Firefox 
with webdriver.Firefox(options=options) as driver: 
    # opening the target website in the browser 
    driver.get(url) 
 
    print("Page URL:", driver.current_url) 
    print("Page Title:", driver.title) 
 
    # using Selenium's find_elements() API to find the parent element 
    pokemon_list = driver.find_elements(By.XPATH, "//a[@class='woocommerce-LoopProduct-link woocommerce-loop-product__link']") 
 
    # using Seleniumm's find_element() API to locate each of the child elements 
    for pokemon in pokemon_list: 
        pokemon_name = pokemon.find_element(By.XPATH, ".//h2") 
        pokemon_price = pokemon.find_element(By.XPATH, ".//span") 
 
        # parsing the extracted data into a python dictionary 
        clones = { 
            "name": pokemon_name.text, 
            "price": pokemon_price.text 
        } 
 
        print(clones)

这是我们抓取的数据：

Page URL: https://scrapeme.live/shop/ 
Page Title: Products - ScrapeMe 
 
{'name': 'Bulbasaur', 'price': '£63.00'} 
{'name': 'Ivysaur', 'price': '£87.00'} 
{'name': 'Venusaur', 'price': '£105.00'} 
{'name': 'Charmander', 'price': '£48.00'} 
{'name': 'Charmeleon', 'price': '£165.00'} 
{'name': 'Charizard', 'price': '£156.00'} 
{'name': 'Squirtle', 'price': '£130.00'} 
{'name': 'Wartortle', 'price': '£123.00'} 
{'name': 'Blastoise', 'price': '£76.00'} 
{'name': 'Caterpie', 'price': '£73.00'} 
{'name': 'Metapod', 'price': '£148.00'} 
{'name': 'Butterfree', 'price': '£162.00'} 
{'name': 'Weedle', 'price': '£25.00'} 
{'name': 'Kakuna', 'price': '£148.00'} 
{'name': 'Beedrill', 'price': '£168.00'} 
{'name': 'Pidgey', 'price': '£159.00'}

对于许多 Internet 站点，您将不得不处理反机器人保护。要学习一些可行的技巧，请查看我们的指南部分，了解如何避免使用无头浏览器被检测到。

结论

使用Python Selenium 使 Firefox 无头是在网络抓取中自动化网络浏览器任务的最佳选择之一。我们还学习了如何使用它们提取数据。许多开发人员还尝试使用 Web 抓取 API 来绕过各种保护并节省资源。

如何用Python在Selenium中以编程方式使火狐无头化

什么是无头 Firefox？

什么是无头模式浏览器？让我们谈谈好处

Firefox 可以无头运行吗？如何

如何启动 Firefox Headless？

运行 Firefox Headless 给我带来了什么？

结论

相关

如何在Python中轮换代理IP地址

最佳CAPTCHA（验证码）代理：快速且可靠

如何使用cURL绕过Cloudflare

推荐10款最常用的数据挖掘工具和软件

7种常见的反爬技术

如何使用Selenium Stealth进行网页抓取

什么是无头 Firefox？

什么是无头模式浏览器？让我们谈谈好处

Firefox 可以无头运行吗？如何

如何启动 Firefox Headless？

运行 Firefox Headless 给我带来了什么？

结论

相关

类似文章