如何用R语言进行数据爬取：网页数据爬取

R语言的网页抓取是人们用来从网站提取数据的最流行方法之一。在此分步教程中，您将学习如何借助 rvest 和 RSelenium 等库在 R 中进行网页抓取。

R 适合网络抓取吗？

R 是一种专为数据科学设计的高级编程语言，具有许多面向数据的库来支持您的网络抓取目标。

设置环境

以下是我们将在本教程中使用的工具：

R 4+：任何版本的 R 大于或等于 4 都可以。在此 R 网络抓取教程中，我们使用了 R 4.2.2 版本，因为它是撰写本文时的最新版本。
R IDE：安装并启用了R Language for IntelliJ插件的PyCharm是构建 R 网络抓取工具的绝佳选择。同样，带有REditorSupport扩展的免费Visual Studio Code IDE也可以。

如果您没有安装这些工具，请单击上面的链接并下载它们。使用官方指南安装它们，您将拥有遵循此分步网络抓取 R 指南所需的一切。

在 PyCharm 中设置 R 项目

环境搭建完成后，在PyCharm中初始化一个R web scraping项目，分三步进行：

启动 PyCharm 并单击“文件 > 新建 > 项目…”菜单选项。
在“新建项目”弹出窗口的左侧栏中，选择“R 项目”。
为您的 R 项目命名，并确保选择一个 R 解释器，如下图所示：

单击“创建”并等待 PyCharm 初始化您的 R web 项目。您现在应该可以访问以下项目：

main.RPyCharm 将为您创建一个空白文件。请注意 PyCharm 提供的 R 工具部分和 R 控制台选项卡，因为它们分别允许您访问 R shell 和包管理。

我们已经掌握了基础知识，所以让我们深入了解如何在 R 中构建数据抓取器的细节。

如何使用R语言抓取网站

对于这个 R 数据抓取教程，我们将使用ScrapeMe作为我们的目标站点。

请注意，ScrapeMe 仅包含一个分页的 Pokemon 启发元素列表。您将学习如何构建的 R 抓取工具将能够抓取整个网站并提取所有产品数据。

第 1 步：安装 rvest

rvest是一个 R 库，可帮助您通过其高级 R 网络抓取 API 从网页中抓取数据。这使您能够下载 HTML 文档、解析它、选择 HTML 元素并从中提取数据。

要安装 rvest，请在 PyCharm 中打开 R 控制台并启动以下命令：

install.packages("rvest")

等待安装过程完成。这可能需要几秒钟，尤其是当 R 还需要安装其依赖项时。安装后，将其加载到您的main.R文件中：

library(rvest)

如果 PyCharm 没有报告错误，则安装过程成功结束。

第 2 步：检索 HTML 页面

使用带有 rvest 的远程 URL 使用单行代码下载 HTML 文档：

# retrieving the target web page 
document <- read_html("https://scrapeme.live/shop")

该read_html()函数使用作为参数传递的 URL 检索下载的 HTML，然后对其进行解析并将生成的数据结构分配给变量document。

第 3 步：识别并选择最重要的 HTML 元素

此 R 数据抓取教程的目标是提取所有产品数据，因此产品 HTML 节点是最重要的元素。要选择它们，请右键单击产品 HTML 元素并选择“检查”选项。这将启动以下 DevTools 弹出窗口：

从 HTML 代码中注意到li.productHTML 元素包括：

一个a存储产品 URL 的。
包含img产品图片的。
h2保留产品名称的A。
存储span产品价格的。

选择 rvest 中目标网页中包含的所有 HTML 元素：

# selecting the list of product HTML elements 
html_products <- document %>% html_elements("li.product")

这将使用R 管道运算 html_elements()符执行 rvest 函数。具体来说，返回应用CSS 选择器或XPath 表达式找到的 HTML 元素列表。document%>%html_elements()

给定一个 HTML 产品，选择所有四个感兴趣的 HTML 节点：

# selecting the "a" HTML element storing the product URL 
a_element <- html_product %>% html_element("a") 
# selecting the "img" HTML element storing the product image 
img_element <- html_product %>% html_element("img") 
# selecting the "h2" HTML element storing the product name 
h2_element <- html_product %>% html_element("h2") 
# selecting the "span" HTML element storing the product price 
span_element <- html_product %>% html_element("span")

第 4 步：从 HTML 元素中提取数据

R 在将元素附加到列表时效率低下，因此您应该避免迭代每个 HTML 产品。相反，使用 rvest 因为它允许您通过单个操作从 HTML 元素列表中提取数据：

# extracting data from the list of products and storing the scraped data into 4 lists 
product_urls <- html_products %>% 
    html_element("a") %>% 
    html_attr("href") 
product_images <- html_products %>% 
    html_element("img") %>% 
    html_attr("src") 
product_names <- html_products %>% 
    html_element("h2") %>% 
    html_text2() 
product_prices <- html_products %>% 
    html_element("span") %>% 
    html_text2()

rvest 将队列语句的最后一个函数应用于使用html_element()from选择的每个 HTML 元素html_products。html_attr()返回存储在单个属性中的字符串。类似地，html_text2()返回 HTML 元素中在浏览器中显示的文本。

所以product_urls、product_images和将为产品元素的 4 个属性中的每一个存储抓取的数据product_names。product_prices提取的数据存储在 4 个列表中，现在让我们将它们转换为更易于管理的dataframe：

# converting the lists containg the scraped data into a dataframe 
products <- data.frame( 
    product_urls, 
    product_images, 
    product_names, 
    product_prices 
)

请注意，Rdata.frame()函数将所有抓取的数据聚合到products变量中。

恭喜！您刚刚学习了如何使用 rvest 在 R 中抓取网页。继续并dataframe在下面的步骤中将其提取到 CSV 文件。

第 5 步：将抓取的数据导出为 CSV

在将变量转换products为 CSV 格式之前，使用更改其列名names()。它允许您更改与每个组件关联的名称dataframe，以便导出的 CSV 文件更易于阅读。

# changing the column names of the data frame before exporting it into CSV 
names(products) <- c("url", "image", "name", "price")

使用方法将对象导出dataframe到 CSV 文件write.csv()，这会指示您的 R 网络爬虫生成products.csv包含已抓取数据的文件。

# export the data frame containing the scraped data to a CSV file 
write.csv(products, file = "./products.csv", fileEncoding = "UTF-8")

如果您在 PyCharm 中运行脚本，您将products.csv在根文件夹中找到一个文件。它将包含您的输出：

第 6 步：将所有内容放在一起

这是最终的 R 数据抓取工具的样子：

library(rvest) 
 
httr::set_config(httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")) 
 
# initializing the lists that will store the scraped data 
product_urls <- list() 
product_images <- list() 
product_names <- list() 
product_prices <- list() 
 
# initializing the list of pages to scrape with the first pagination links 
pages_to_scrape <- list("https://scrapeme.live/shop/page/1/") 
 
# initializing the list of pages discovered 
pages_discovered <- pages_to_scrape 
 
# current iteration 
i <- 1 
# max pages to scrape 
limit <- 1 
 
# until there is still a page to scrape 
while (length(pages_to_scrape) != 0 && i <= limit) { 
    # getting the current page to scrape 
    page_to_scrape <- pages_to_scrape[[1]] 
 
    # removing the page to scrape from the list 
    pages_to_scrape <- pages_to_scrape[-1] 
 
    # retrieving the current page to scrape 
    document <- read_html(page_to_scrape, 
        user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36") 
 
    # extracting the list of pagination links 
    new_pagination_links <- document %>% 
        html_elements("a.page-numbers") %>% 
        html_attr("href") 
 
    # iterating over the list of pagination links 
    for (new_pagination_link in new_pagination_links) { 
        # if the web page discovered is new and should be scraped 
        if (!(new_pagination_link %in% pages_discovered) && !(new_pagination_link %in% page_to_scrape)) { 
            pages_to_scrape <- append(new_pagination_link, pages_to_scrape) 
        } 
 
        # discovering new pages 
        pages_discovered <- append(new_pagination_link, pages_discovered) 
    } 
 
    # removing duplicates from pages_discovered 
    pages_discovered <- pages_discovered[!duplicated(pages_discovered)] 
 
    # selecting the list of product HTML elements 
    html_products <- document %>% html_elements("li.product") 
 
    # appending the new results to the lists of scraped data 
    product_urls <- c( 
        product_urls, 
        html_products %>% 
            html_element("a") %>% 
            html_attr("href") 
    ) 
    product_images <- c( 
        product_images, 
        html_products %>% 
            html_element("img") %>% 
            html_attr("src")) 
    product_names <- c( 
        product_names, 
        html_products %>% 
            html_element("h2") %>% 
            html_text2() 
    ) 
    product_prices <- c( 
        product_prices, 
        html_products %>% 
            html_element("span") %>% 
            html_text2() 
    ) 
 
    # incrementing the iteration counter 
    i <- i + 1 
} 
 
# converting the lists containg the scraped data into a data.frame 
products <- data.frame( 
    unlist(product_urls), 
    unlist(product_images), 
    unlist(product_names), 
    unlist(product_prices) 
) 
 
# changing the column names of the data frame before exporting it into CSV 
names(products) <- c("url", "image", "name", "price") 
 
# export the data frame containing the scraped data to a CSV file 
write.csv(products, file = "./products.csv", fileEncoding = "UTF-8", row.names = FALSE)

极好的！在几行代码中，您在 R 中构建了一个数据网络抓取工具！

R 中网页抓取的高级技术

您刚刚学习了 R 中网络抓取的基础知识。是时候深入研究更高级的技术了。

避免块

请注意，您的目标网站可能会实施多种反抓取技术。大多数网站禁止基于其 HTTP 标头的请求。例如，他们倾向于阻止没有有效User-Agent标头的 HTTP 请求。我们推荐ZenRows来解决所有这些挑战。

如官方文档中所述，rvesthttr在后台使用并User-Agent通过更改配置为 rvest设置一个全局httr变量，如下所示：

httr::set_config(httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"))

httr::set_config()此行为带有和的基于 URL 的请求设置 HTTP 用户代理httr::user_agent()。

请记住，这只是避免被反抓取解决方案阻止的基本方法。详细了解防刮技术。

R 中的网络爬行

目标网站由多个网页组成。因此，要抓取整个网站并检索所有数据，您必须提取所有分页链接的列表。这基本上就是网络爬虫的意义所在。

首先，右键单击页码 HTML 元素并选择“检查”。

devtools_inspect_scrapeme_live — 选择“检查”选项以打开 DevTools 窗口

您的浏览器将打开以下 DevTools 窗口：

选择页码 HTML 元素后的 DevTools 窗口

通过检查分页 HTML 元素，您会注意到它们都共享同一个page-numbers类。因此，使用以下方法检索所有分页链接html_elements()：

pagination_links <- document %>% 
    html_elements("a.page-numbers") %>% 
    html_attr("href")

目标是访问所有网页，与此同时，您可能希望以编程方式停止 R 爬虫。最好的方法是引入一个limit变量：

library(rvest) 
 
httr::set_config(httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")) 
 
# initializing the lists that will store the scraped data 
product_urls <- list() 
product_images <- list() 
product_names <- list() 
product_prices <- list() 
 
# initializing the list of pages to scrape with the first pagination links 
pages_to_scrape <- list("https://scrapeme.live/shop/page/1/") 
 
# initializing the list of pages discovered 
pages_discovered <- pages_to_scrape 
 
# current iteration 
i <- 1 
# max pages to scrape 
limit <- 5 
 
# until there is still a page to scrape 
while (length(pages_to_scrape) != 0 && i <= limit) { 
    # getting the current page to scrape 
    page_to_scrape <- pages_to_scrape[[1]] 
 
    # removing the page to scrape from the list 
    pages_to_scrape <- pages_to_scrape[-1] 
 
    # retrieving the current page to scrape 
    document <- read_html(page_to_scrape) 
 
    # extracting the list of pagination links 
    new_pagination_links <- document %>% 
        html_elements("a.page-numbers") %>% 
        html_attr("href") 
 
    # iterating over the list of pagination links 
    for (new_pagination_link in new_pagination_links) { 
        # if the web page discovered is new and should be scraped 
        if (!(new_pagination_link %in% pages_discovered) && !(new_pagination_link %in% page_to_scrape)) { 
            pages_to_scrape <- append(new_pagination_link, pages_to_scrape) 
        } 
 
        # discovering new pages 
        pages_discovered <- append(new_pagination_link, pages_discovered) 
    } 
 
    # removing duplicates from pages_discovered 
    pages_discovered <- pages_discovered[!duplicated(pages_discovered)] 
 
    # scraping logic... 
 
    # incrementing the iteration counter 
    i <- i + 1 
} 
 
# export logic...

R 网页抓取脚本进程抓取网页，寻找新的分页链接并填充抓取队列。目标网站有 48 个页面，所以分配 48 来limit抓取整个网站。在循环结束时while，pages_discovered将存储所有 48 个分页 URL。

在 PyCharm 中运行 R 网络抓取工具，它将生成一个products.csv文件。这将包含所有 755 种口袋妖怪风格的产品。

恭喜！你刚刚使用 R 爬取了 ScrapeMe 上的所有产品数据！

R 中的并行 Web 抓取

如果您的目标站点有很多页面或服务器响应缓慢，则 R 中的数据抓取可能需要很长时间。幸运的是，R 支持并行化！让我们学习如何使用 R 进行并行网络抓取。

如果您对此不熟悉，并行抓取就是同时抓取多个页面。这意味着同时从多个网页下载、解析和提取数据，从而加快抓取过程。

首先，parallel在您的 R 数据抓取工具中导入包：

library(parallel)

现在，使用此 R 库公开的实用函数来执行并行计算。

现在，您有了要抓取的网页列表：

pages_to_scrape <- list( 
    "https://scrapeme.live/shop/page/1/", 
    "https://scrapeme.live/shop/page/2/", 
    "https://scrapeme.live/shop/page/3/", 
    # ... 
    "https://scrapeme.live/shop/page/48/" 
)

使用 R 构建一个并行网络抓取器，如下所示：

scrape_page <- function(page_url) { 
    # loading the rvest library 
    library(rvest) 
 
    # setting the user agent 
    httr::set_config(httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")) 
 
    # retrieving the current page to scrape 
    document <- read_html(page_url) 
 
    html_products <- document %>% html_elements("li.product") 
 
    product_urls <- html_products %>% 
        html_element("a") %>% 
        html_attr("href") 
 
    product_images <- html_products %>% 
        html_element("img") %>% 
        html_attr("src") 
    product_names <- 
        html_products %>% 
            html_element("h2") %>% 
            html_text2() 
 
    product_prices <- 
        html_products %>% 
            html_element("span") %>% 
            html_text2() 
 
    products <- data.frame( 
        unlist(product_urls), 
        unlist(product_images), 
        unlist(product_names), 
        unlist(product_prices) 
    ) 
 
    names(products) <- c("url", "image", "name", "price") 
 
    return(products) 
} 
 
# automatically detecting the number of cores 
num_cores <- detectCores() 
 
# creting a parallel cluster 
cluster <- makeCluster(num_cores) 
 
# execute scrape_page() on each element of pages_to_scrape 
# in parallel 
scraped_data_list <- parLapply(cluster, pages_to_scrape, scrape_page) 
 
# merging the list of dataframes into 
# a single dataframe 
products <- do.call("rbind", scraped_data_list)

这个 R 爬虫脚本定义了一个scrape_page()函数，负责抓取单个网页。然后，它初始化一个 R 集群，以makeCluster()创建一组并行运行并通过套接字通信的 R 实例副本。

用于parLapply()将scrape_page()函数应用于网页列表。注意parLapply()是并行版本apply()。scrape_page()R 将在新实例中运行每个函数。

所以每个并行实例只能访问内部定义的内容scrape_page()，因此，第一行scrape_page()是这样的：

# loading the rvest library 
library(rvest)

如果您在 R 数据抓取脚本中实施此并行化逻辑，您将获得显着的性能提升。另一方面，R web scraper 将需要更多的内存资源。

完美的！您现在知道如何使用 R 进行并行网络抓取了！但是教程还没有完成！

R语言抓取动态内容网站

在静态内容网站中，服务器返回的 HTML 文档已经包含了其源代码中的所有内容。这意味着您无需在 Web 浏览器中呈现文档即可从中检索数据。

但是网页不仅仅是它们的 HTML 源代码。当在浏览器中呈现时，网页可以通过AJAX执行 HTTP 请求并运行 JavaScript，因此网页动态地从 Web 检索数据并相应地更改 DOM。

大多数网站依赖 API 调用来执行前端操作或异步检索数据。换句话说，HTML 源代码不包含这些有价值的数据。抓取此数据的唯一方法是在无头浏览器中呈现目标网页。

无头浏览器允许您在没有 GUI 的浏览器中加载网页。因此，它使您能够指示浏览器执行操作并复制用户交互。现在让我们看看如何在 R 中使用无头浏览器进行网页抓取。

使用 R 中的无头浏览器进行 Web 抓取

使用无头浏览器，您可以构建一个 R 网络抓取工具，它能够像人类用户一样通过 JavaScript 与网站进行交互。R 中最流行的无头浏览器库是RSelenium，我们将在本教程中使用它。

通过在 R 控制台中启动以下命令来安装 RSelenium：

install.packages("RSelenium");

安装后，运行以下代码从 ScrapeMe 中提取数据：

# load the Chrome driver 
# TODO: change the chromever parameter with your Chrome version 
driver <- rsDriver( 
    browser = c("chrome"), 
    chromever = "108.0.5359.22", 
    verbose = F, 
    # enabling the Chrome --headless mode 
    extraCapabilities = list("chromeOptions" = list(args = list("--headless"))) 
) 
web_driver <- driver[["client"]] 
 
# navigatign to the target web page in the headless Chrome instance 
web_driver$navigate("https://scrapeme.live/shop/") 
 
# initializing the lists that will contain the scraped data 
product_urls <- list() 
product_images <- list() 
product_names <- list() 
product_prices <- list() 
 
# retrieving all produt HTML elements 
html_products <- web_driver$findElements(using = "css selector", value = "li.product") 
 
# iterating over the product list 
for (html_product in html_products) { 
    # scraping the data of interest from each product while populating the 4 scraping data lists 
    product_urls <- append( 
        product_urls, 
        html_product$findChildElement(using = "css selector", value = "a")$getElementAttribute("href")[[1]] 
    ) 
 
    product_images <- append( 
        product_images, 
        html_product$findChildElement(using = "css selector", value = "img")$getElementAttribute("src")[[1]] 
    ) 
 
    product_names <- append( 
        product_names, 
        html_product$findChildElement(using = "css selector", value = "h2")$getElementText()[[1]] 
    ) 
 
    product_prices <- append( 
        product_prices, 
        html_product$findChildElement(using = "css selector", value = "span")$getElementText()[[1]] 
    ) 
} 
 
# converting the lists containg the scraped data into a data.frame 
products <- data.frame( 
    unlist(product_urls), 
    unlist(product_images), 
    unlist(product_names), 
    unlist(product_prices) 
) 
 
# changing the column names of the data frame before exporting it into CSV 
names(products) <- c("url", "image", "name", "price") 
 
# export the data frame containing the scraped data to a CSV file 
write.csv(products, file = "./products.csv", fileEncoding = "UTF-8", row.names = FALSE)

请注意，该findElements()方法允许您使用 RSelenium 选择 HTML 元素。然后，调用findChildElements()以从 RSelenium HTML 元素中选择一个子元素。最后，使用getElementAttribute()和之类的方法getElementText()从选定的 HTML 元素中提取数据。

使用 rvest 和 RSelenium 在 R 中进行网页抓取的方法非常相似。主要区别在于 RSelenium 在浏览器中运行抓取过程，让您可以访问所有浏览器功能。

所以如果你想用 rvest 从一个新页面抓取数据，你需要先执行一个新的 HTTP 请求。相反，RSelenium 允许您在浏览器中直接导航到新网页，如下所示：

pagination_element <- web_driver$findElement(using = "css selector", value = "a.page-numbers") 
 
# navigating to a new web page 
paginationElement.click(); 
 
# waiting for the page to load... 
 
print(web_driver$getTitle()); # prints "Products – Page 2 – ScrapeMe"

RSelenium 可以轻松地抓取网页而不会被阻止，因为它具有像普通人一样抓取网页的能力。这使得它成为 R 中数据抓取的最佳选择之一。

R 的其他 Web 抓取库

R 中用于网络抓取的其他有用的库是：
ZenRows：一个网络抓取 API，可以为您绕过所有反机器人或反抓取系统，提供旋转代理、无头浏览器、验证码绕过等。
RCrawler：用于网络爬虫网站和执行网络抓取的 R 包。它提供了多种功能来从网页中提取结构化数据。
xmlTreeParse：一个用于解析 XML/HTML 文件或字符串的 R 库。它生成一个表示 XML/HTML 树的 R 结构，并允许您从中选择元素。

常见问题

你如何从 R 中的网站抓取数据？

您可以像使用任何其他编程语言一样使用 R 抓取网站。首先，您需要一个网络抓取 R 库。然后您必须使用它连接到您的目标网站，提取 HTML 元素并从中检索数据。

R 中的 Rvest 包是什么？

rvest 是最流行的网络抓取 R 库之一，提供多种功能使 R 网络抓取更容易。rvest 包装了xml2和httr包，允许您下载 HTML 文档并从中提取数据。

Scraping

7个常见的网页抓取用例

By姚伟斌 December 4, 2023August 9, 2023

网络抓取如何帮助您的业务发展？从市场研究到机器学习培训，提取知识可以帮助和指导任何行业领域的任何数据驱动决策。您可以通过采用这些用例之一并手动跟踪它来轻松演示它，看看它是否有效。之后，剩下的问题将是如何自动执行此操作。 1. 房地产您还在每天查看您所在地区新发布的房屋吗？或者正在寻找便宜货？通过跟踪房地产网站，您可以及时获取所有这些精选信息，而无需每天手动搜索。此外，您可以通过存储这些信息来跟踪每个功能或社区的价格历史记录，从而为您提供宝贵的见解。但也没有必要就此止步。将该历史记录与所有新房产进行比较，您可以发现最具成本效益的房产。或者检查某些竞争对手在特定区域的售价是否更便宜。我们使用ZenRows Task创建了一个房地产数据集。填写以下表格，我们将免费发送给您。您可以按照我们的创建任务指南轻松获得自定义数据集。 2. 训练机器学习模型通过抓取与主题相关的网站来收集大量数据（文本或图像）。这些信息可能来自科学论文、报纸或社交媒体，只要能满足您的需求即可。如果您的模型包含动物图像识别，您可能会对获取大量图片感兴趣。你可以简单地通过搜索谷歌图片来做到这一点，但你需要更大的规模，这可以通过网站抓取来实现。最好的是什么：为什么不为监督学习标记图片呢？图像通常带有标签或标题，其中包含提及动物的描述性文字。您可以将这些结果扩展到来自许多不同来源的数千张标记图像。但优势还可以更进一步：通过反复进行数据提取来获得连续的知识流。比如说，每周访问几本自然杂志，提取所有这些图片并将它们添加到您的收藏中。 3、品牌美誉度与上一点相关，您可以监控您的品牌或竞争对手，并使用情绪分析来了解市场对您或他们的评价。在内部，这可能会让您收到无法到达客户支持的投诉。许多人在…

如何用R语言进行数据爬取：网页数据爬取

R 适合网络抓取吗？

设置环境

在 PyCharm 中设置 R 项目

如何使用R语言抓取网站

第 1 步：安装 rvest

第 2 步：检索 HTML 页面

第 3 步：识别并选择最重要的 HTML 元素

第 4 步：从 HTML 元素中提取数据

第 5 步：将抓取的数据导出为 CSV

第 6 步：将所有内容放在一起

R 中网页抓取的高级技术

避免块

R 中的网络爬行

R 中的并行 Web 抓取

R语言抓取动态内容网站

使用 R 中的无头浏览器进行 Web 抓取

R 的其他 Web 抓取库

常见问题

Related

如何避免Puppeteer被检测

静态代理与旋转代理的区别

5种用于网络抓取的常见HTTP标头

如何收集数据以绘制房价图

如何用XPath抓取网页数据

7个常见的网页抓取用例

R 适合网络抓取吗？

设置环境

在 PyCharm 中设置 R 项目

如何使用R语言抓取网站

第 1 步：安装 rvest

第 2 步：检索 HTML 页面

第 3 步：识别并选择最重要的 HTML 元素

第 4 步：从 HTML 元素中提取数据

第 5 步：将抓取的数据导出为 CSV

第 6 步：将所有内容放在一起

R 中网页抓取的高级技术

避免块

R 中的网络爬行

R 中的并行 Web 抓取

R语言抓取动态内容网站

使用 R 中的无头浏览器进行 Web 抓取

R 的其他 Web 抓取库

常见问题

Related

Similar Posts