如何使用Go语言进行网页爬取

Golang 中的网络抓取是一种流行的自动从网络检索数据的方法。按照这个分步教程学习如何在 Go 中轻松抓取数据并了解流行的库 Colly 和 chromedp。

准备工作

设置环境

以下是本教程必须满足的先决条件：

Go 1.19+：任何大于或等于 1.19 的 Go 版本都可以。您将在此处看到 1.19 版，因为它是撰写本文时的最新版本。
Go IDE：推荐使用带有Go 扩展的 Visual Studio Code 。

在继续本网页抓取指南之前，请确保您已安装必要的工具。按照上面的链接按照安装向导下载、安装和设置所需的工具。

设置一个 Go 项目

安装 Go 后，是时候初始化您的 Golang 网络抓取程序项目了。

创建一个web-scraper-go文件夹并在终端中输入：

mkdir web-scraper-go 
cd web-scraper-go

然后，启动以下命令：

go mod init web-scraper

该init命令将初始化项目文件夹web-scraper中的 Go 模块web-scraper-go。

web-scraper-go现在将包含一个go.mod如下所示的文件：

module web-scraper-go 
go 1.19

请注意，最后一行会根据您的语言版本而变化。

您现在已准备好设置网络抓取 Go 脚本。创建一个scraper.go文件并初始化它如下：

package main 
 
import ( 
    "fmt" 
) 
 
func main() { 
    // scraping logic... 
 
    fmt.Println("Hello, World!") 
}

第一行包含全局包的名称。然后，有一些进口，其次是main()功能。这代表任何 Go 程序的入口点，并将包含 Golang 网络抓取逻辑。

运行脚本以验证一切是否按预期工作：

go run scraper.go

那将打印：

Hello, World!

现在您已经设置了一个基本的 Go 项目，让我们更深入地研究如何使用 Golang 构建数据抓取器。

如何在 Go 中抓取网站

要了解如何在 Go 中抓取网站，请使用ScrapeMe作为目标网站。

如您所见，这是一家神奇宝贝商店。我们的任务是从中提取所有产品数据。

第 1 步：开始使用 Colly

Colly是一个开源库，它提供了一个基于回调的干净接口来编写爬虫、爬虫或蜘蛛。它带有一个高级的 Go 网络抓取 API，允许您下载 HTML 页面、自动解析其内容、从 DOM 中选择 HTML 元素并从中检索数据。

安装 Colly 及其依赖项：

go get github.com/gocolly/colly

此命令将在您的项目根目录中创建一个go.sum文件，并相应地使用所有必需的依赖项更新该go.mod文件。

等待安装过程结束。然后，在您的文件中导入 Colly，scraper.go如下所示：

package main 
 
import ( 
    "fmt" 
 
    // importing Colly 
    "github.com/gocolly/colly" 
) 
 
func main() { 
    // scraping logic... 
 
    fmt.Println("Hello, World!") 
}

在开始使用这个库进行抓取之前，您需要了解一些关键概念。

首先，Colly 的主要实体是Collector. ACollector允许您执行 HTTP 请求。此外，它还允许您访问Colly 界面提供的网络抓取回调。

Collector使用以下函数初始化 Colly NewCollector：

c := colly.NewCollector()

使用 Colly 访问网页Visit()：

c.Visit("https://en.wikipedia.org/wiki/Main_Page")

将不同类型的回调函数附加到 aCollector如下：

c.OnRequest(func(r *colly.Request) { 
    fmt.Println("Visiting: ", r.URL) 
}) 
 
c.OnError(func(_ *colly.Response, err error) { 
    log.Println("Something went wrong: ", err) 
}) 
 
c.OnResponse(func(r *colly.Response) { 
    fmt.Println("Page visited: ", r.Request.URL) 
}) 
 
c.OnHTML("a", func(e *colly.HTMLElement) { 
    // printing all URLs associated with the a links in the page 
    fmt.Println("%v", e.Attr("href")) 
}) 
 
c.OnScraped(func(r *colly.Response) { 
    fmt.Println(r.Request.URL, " scraped!") 
})

这些函数按以下顺序执行：

OnRequest()：在使用执行 HTTP 请求之前调用Visit()。
OnError()：如果在 HTTP 请求期间发生错误，则调用。
OnResponse(): 收到服务器响应后调用。
OnHTML()OnResponse():如果接收到的内容是 HTML，则立即调用。
OnScraped()OnHTML()：在所有回调执行之后调用。

这些函数中的每一个都接受回调作为参数。当引发与 Colly 函数关联的事件时，将执行特定回调。因此，这五个 Colly 函数可帮助您构建 Golang 数据抓取工具。

第 2 步：访问目标 HTML 页面

执行 HTTP GET 请求以在 Colly 中下载目标 HTML 页面：

// downloading the target HTML page 
c.Visit("https://scrapeme.live/shop/")

该Visit()函数通过触发事件来启动 Colly 的生命周期onRequest。其他事件将随之而来。

第 3 步：找到感兴趣的 HTML 元素

这个数据抓取 Go 教程是关于检索所有产品数据的，所以让我们抓取 HTML 产品元素。右键单击页面上的产品元素，然后选择“检查”选项以访问 DevTools 部分：

在这里，请注意目标liHTML 元素具有.product类和存储：

a带有产品 URL 的元素。
img带有产品图片的元素。
h2带有产品名称的元素。
具有产品价格的元素.price。

li.product使用 Colly 选择页面中的所有HTML 产品元素：

c.OnHTML("li.product", func(e *colly.HTMLElement) { 
    // ... 
})

该函数可以与CSS 选择器和回调函数OnHTML()相关联。Colly 会在找到匹配选择器的 HTML 元素时执行回调。请注意，回调函数的参数表示单个.eli.product HTMLElement

现在让我们看看如何使用 Colly 公开的函数从 HTML 元素中提取数据。

第 4 步：从选定的 HTML 元素中抓取产品数据

在开始之前，您需要一个数据结构来存储抓取的数据。定义PokemonProduct Struct如下：

// defining a data structure to store the scraped data 
type PokemonProduct struct { 
    url, image, name, price string 
}

如果您对此不熟悉，GoStruct是您可以实例化以收集数据的类型化字段的集合。

然后，初始化其中包含已抓取数据的切片：PokemonProduct

// initializing the slice of structs that will contain the scraped data 
var pokemonProducts []PokemonProduct

在 Go 中，切片提供了一种处理类型化数据序列的有效方法。您可以将它们视为某种列表。

现在，实现抓取逻辑：

// iterating over the list of HTML product elements 
c.OnHTML("li.product", func(e *colly.HTMLElement) { 
    // initializing a new PokemonProduct instance 
    pokemonProduct := PokemonProduct{} 
 
    // scraping the data of interest 
    pokemonProduct.url = e.ChildAttr("a", "href") 
    pokemonProduct.image = e.ChildAttr("img", "src") 
    pokemonProduct.name = e.ChildText("h2") 
    pokemonProduct.price = e.ChildText(".price") 
 
    // adding the product instance with scraped data to the list of products 
    pokemonProducts = append(pokemonProducts, pokemonProduct) 
})

该HTMLElement接口公开了ChildAttr()和ChildText()方法。这些允许您分别从 CSS 选择器标识的子项中提取属性值的文本。通过两个简单的函数，您实现了整个数据提取逻辑。

最后，您可以使用将新元素附加到已抓取元素的切片中append()。详细了解Go 的工作原理append()。

极好的！您刚刚学习了如何使用 Colly 在 Go 中抓取网页。

下一步将检索到的数据导出到 CSV。

第 5 步：将抓取的数据转换为 CSV

使用以下逻辑将抓取的数据导出到 Go 中的 CSV 文件：

// opening the CSV file 
file, err := os.Create("products.csv") 
if err != nil { 
    log.Fatalln("Failed to create output CSV file", err) 
} 
defer file.Close() 
 
// initializing a file writer 
writer := csv.NewWriter(file) 
 
// defining the CSV headers 
headers := []string{ 
    "url", 
    "image", 
    "name", 
    "price", 
} 
// writing the column headers 
writer.Write(headers) 
 
// adding each Pokemon product to the CSV output file 
for _, pokemonProduct := range pokemonProducts { 
    // converting a PokemonProduct to an array of strings 
    record := []string{ 
        pokemonProduct.url, 
        pokemonProduct.image, 
        pokemonProduct.name, 
        pokemonProduct.price, 
    } 
 
    // writing a new CSV record 
    writer.Write(record) 
} 
defer writer.Flush()

此代码段创建一个products.csv文件并使用标题列对其进行初始化。然后，它遍历抓取的 s 切片PokemonProduct，将它们中的每一个转换为新的 CSV 记录，并将其附加到 CSV 文件中。

要使此代码段有效，请确保您具有以下导入：

import ( 
    "encoding/csv" 
    "log" 
    "os" 
    // ... 
)

所以，这就是抓取脚本现在的样子：

package main 
 
import ( 
    "encoding/csv" 
    "github.com/gocolly/colly" 
    "log" 
    "os" 
) 
 
// initializing a data structure to keep the scraped data 
type PokemonProduct struct { 
    url, image, name, price string 
} 
 
func main() { 
    // initializing the slice of structs to store the data to scrape 
    var pokemonProducts []PokemonProduct 
 
    // creating a new Colly instance 
    c := colly.NewCollector() 
 
    // visiting the target page 
    c.Visit("https://scrapeme.live/shop/") 
 
    // scraping logic 
    c.OnHTML("li.product", func(e *colly.HTMLElement) { 
        pokemonProduct := PokemonProduct{} 
 
        pokemonProduct.url = e.ChildAttr("a", "href") 
        pokemonProduct.image = e.ChildAttr("img", "src") 
        pokemonProduct.name = e.ChildText("h2") 
        pokemonProduct.price = e.ChildText(".price") 
 
        pokemonProducts = append(pokemonProducts, pokemonProduct) 
    }) 
 
    // opening the CSV file 
    file, err := os.Create("products.csv") 
    if err != nil { 
        log.Fatalln("Failed to create output CSV file", err) 
    } 
    defer file.Close() 
 
    // initializing a file writer 
    writer := csv.NewWriter(file) 
 
    // writing the CSV headers 
    headers := []string{ 
        "url", 
        "image", 
        "name", 
        "price", 
    } 
    writer.Write(headers) 
 
    // writing each Pokemon product as a CSV row 
    for _, pokemonProduct := range pokemonProducts { 
        // converting a PokemonProduct to an array of strings 
        record := []string{ 
            pokemonProduct.url, 
            pokemonProduct.image, 
            pokemonProduct.name, 
            pokemonProduct.price, 
        } 
 
        // adding a CSV record to the output file 
        writer.Write(record) 
    } 
    defer writer.Flush() 
}

使用以下命令运行您的 Go 数据抓取工具：

go run scraper.go

然后，您会products.csv在项目的根目录中找到一个文件。打开它，它应该包含以下内容：

使用 Golang 进行网页抓取的高级技术

现在您已经了解了使用 Go 进行网页抓取的基础知识，是时候深入研究更高级的方法了。

使用 Go 进行网络爬虫

请注意，要抓取的 Pokémon 产品列表是分页的。结果，目标网站由许多网页组成。如果要提取所有产品数据，则需要访问整个网站。

要在 Go 中执行网络爬虫并抓取整个网站，您首先需要所有分页链接。因此，右键单击任何页码 HTML 元素并单击“检查”选项。

选择“检查”选项以打开 DevTools 窗口

您的浏览器将允许访问下面的 DevTools 部分，并突出显示所选的 HTML 元素：

medium_devtools_open_scrapeme_live_15069e3f38 — 选择页码 HTML 元素后的 DevTools 窗口

如果您查看所有分页 HTML 元素，您会发现它们都是由.page-numbersCSS 选择器标识的。使用此信息在 Go 中实现爬行，如下所示：

// initializing the list of pages to scrape with an empty slice 
var pagesToScrape []string 
 
// the first pagination URL to scrape 
pageToScrape := "https://scrapeme.live/shop/page/1/" 
 
// initializing the list of pages discovered with a pageToScrape 
pagesDiscovered := []string{ pageToScrape } 
 
// current iteration 
i := 1 
// max pages to scrape 
limit := 5 
 
// initializing a Colly instance 
c := colly.NewCollector() 
 
// iterating over the list of pagination links to implement the crawling logic 
c.OnHTML("a.page-numbers", func(e *colly.HTMLElement) { 
    // discovering a new page 
    newPaginationLink := e.Attr("href") 
 
    // if the page discovered is new 
    if !contains(pagesToScrape, newPaginationLink) { 
        // if the page discovered should be scraped 
        if !contains(pagesDiscovered, newPaginationLink) { 
            pagesToScrape = append(pagesToScrape, newPaginationLink) 
        } 
        pagesDiscovered = append(pagesDiscovered, newPaginationLink) 
    } 
}) 
 
c.OnHTML("li.product", func(e *colly.HTMLElement) { 
    // scraping logic... 
}) 
 
c.OnScraped(func(response *colly.Response) { 
    // until there is still a page to scrape 
    if len(pagesToScrape) != 0 && i < limit { 
        // getting the current page to scrape and removing it from the list 
        pageToScrape = pagesToScrape[0] 
        pagesToScrape = pagesToScrape[1:] 
 
        // incrementing the iteration counter 
        i++ 
 
        // visiting a new page 
        c.Visit(pageToScrape) 
    } 
}) 
 
// visiting the first page 
c.Visit(pageToScrape) 
 
// convert the data to CSV...

由于您可能希望以编程方式停止 Go 数据抓取工具，因此您需要一个limit变量。这表示 Golang 网络蜘蛛可以访问的最大页面数。

在最后一行，代码片段抓取了第一个分页页面。然后，onHTML事件被触发。在onHTML()回调中，Go 网络爬虫搜索新的分页链接。如果找到新链接，它会将其添加到爬行队列中。然后，它用一个新链接重复这个逻辑。最后，它会在limit被击中或没有新页面可供抓取时停止。

pagesDiscovered如果没有和slice 变量，上面的爬行逻辑是不可能的pagesToScrape。这些可以帮助您跟踪 Go 爬虫抓取了哪些页面并将很快访问。

请注意，该contains()函数是一个自定义的 Go 实用函数，定义如下：

func contains(s []string, str string) bool { 
    for _, v := range s { 
        if v == str { 
            return true 
        } 
    } 
 
    return false 
}

它的目的只是检查一个字符串是否存在于切片中。

做得好！现在你可以用 Golang 爬取 ScrapeMe 分页网站了！

避免被封锁

许多网站实施反抓取反机器人技术。最基本的方法涉及根据标头禁止 HTTP 请求。具体来说，它们通常会阻止带有无效User-Agent标头的 HTTP 请求。

User-Agent为 Colly 执行的所有请求设置一个全局头，字段如下：UserAgent Collect

// setting a valid User-Agent header 
c.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"

不要忘记，这只是您可能需要处理的众多反抓取技术中的一种。

在网络抓取时使用ZenRows轻松解决这些挑战。

Golang 中的并行 Web 抓取

Go 中的数据抓取可能会花费很多时间。原因可能是互联网连接速度慢、网络服务器过载或只是有很多页面需要抓取。这就是Colly 支持并行抓取的原因！如果您不知道这意味着什么，Go 中的并行网络抓取涉及同时从多个页面提取数据。

详细来说，这是您希望爬虫访问的所有分页页面的列表：

pagesToScrape := []string{ 
    "https://scrapeme.live/shop/page/1/", 
    "https://scrapeme.live/shop/page/2/", 
    // ... 
    "https://scrapeme.live/shop/page/47/", 
    "https://scrapeme.live/shop/page/48/", 
}

通过并行抓取，您的 Go 数据蜘蛛将能够同时访问多个网页并从中提取数据。这将使您的抓取过程更快！

使用 Colly 实现并行网络爬虫：

c := colly.NewCollector( 
    // turning on the asynchronous request mode in Colly 
    colly.Async(true), 
) 
 
c.Limit(&colly.LimitRule{ 
    // limit the parallel requests to 4 request at a time 
    Parallelism: 4, 
}) 
 
c.OnHTML("li.product", func(e *colly.HTMLElement) { 
    // scraping logic... 
}) 
 
// registering all pages to scrape 
for _, pageToScrape := range pagesToScrape { 
    c.Visit(pageToScrape) 
} 
 
// wait for tColly to visit all pages 
c.Wait() 
 
// export logic...

Colly 带有异步模式。启用后，这允许 Colly 同时访问多个页面。具体来说，Colly 将同时访问与参数值一样多的页面Parallelism。

通过在 Golang 网络抓取脚本中启用并行模式，您将获得更好的性能。同时，您可能需要更改一些代码逻辑。那是因为 Go 中的大多数数据结构都不是线程安全的，因此您的脚本可能会遇到竞争条件。

伟大的！您刚刚学习了如何进行并行网络抓取的基础知识！

在 Go 中使用无头浏览器抓取动态内容网站

静态内容网站的所有内容都预加载在服务器返回的 HTML 页面中。这意味着您只需解析其 HTML 内容即可从静态内容网站抓取数据。

另一方面，其他网站依赖 JavaScript 进行页面渲染或使用它来执行 API 调用和异步检索数据。这些网站称为动态内容网站，需要浏览器才能呈现。

你需要一个可以运行 JavaScript 的工具，比如无头浏览器，它是一个提供浏览器功能的库，允许你在没有 GUI 的特殊浏览器中加载网页。然后，您可以指示无头浏览器模仿用户交互。

Golang 最流行的无头浏览器库是chromedp. 安装它：

go get -u github.com/chromedp/chromedp

然后chromedp在浏览器中使用ScrapeMe抓取数据，如下：

package main 
 
import ( 
    "context" 
    "github.com/chromedp/cdproto/cdp" 
    "github.com/chromedp/chromedp" 
    "log" 
) 
 
type PokemonProduct struct { 
    url, image, name, price string 
} 
 
func main() { 
    var pokemonProducts []PokemonProduct 
 
    // initializing a chrome instance 
    ctx, cancel := chromedp.NewContext( 
        context.Background(), 
        chromedp.WithLogf(log.Printf), 
    ) 
    defer cancel() 
 
    // navigate to the target web page and select the HTML elements of interest 
    var nodes []*cdp.Node 
    chromedp.Run(ctx, 
        chromedp.Navigate("https://scrapeme.live/shop"), 
        chromedp.Nodes(".product", &nodes, chromedp.ByQueryAll), 
    ) 
 
    // scraping data from each node 
    var url, image, name, price string 
    for _, node := range nodes { 
        chromedp.Run(ctx, 
            chromedp.AttributeValue("a", "href", &url, nil, chromedp.ByQuery, chromedp.FromNode(node)), 
            chromedp.AttributeValue("img", "src", &image, nil, chromedp.ByQuery, chromedp.FromNode(node)), 
            chromedp.Text("h2", &name, chromedp.ByQuery, chromedp.FromNode(node)), 
            chromedp.Text(".price", &price, chromedp.ByQuery, chromedp.FromNode(node)), 
        ) 
 
        pokemonProduct := PokemonProduct{} 
 
        pokemonProduct.url = url 
        pokemonProduct.image = image 
        pokemonProduct.name = name 
        pokemonProduct.price = price 
 
        pokemonProducts = append(pokemonProducts, pokemonProduct) 
    } 
 
    // export logic 
}

chromedpNodes()函数使您能够指示无头浏览器执行查询。这样，您可以选择产品 HTML 元素并将它们存储在nodes变量中。然后，迭代它们并应用AttributeValue()和Text()方法来获取感兴趣的数据。

使用 Colly 或 chomedp 在 Go 中执行网页抓取并没有什么不同。这两种方法之间的区别在于 chromedp 在浏览器中运行抓取指令。

使用 chromedp，您可以像真实用户一样抓取动态内容网站并在浏览器中与网页交互。这也意味着您的脚本不太可能被检测为机器人，因此 chromedp 可以轻松抓取网页而不会被阻止。

相反，Colly 仅限于静态内容网站，不提供浏览器的功能。

放在一起：最终代码

这是一个完整的基于 Colly 的 Golang 爬虫，具有爬行和基本的反阻塞逻辑：

package main 
 
import ( 
    "encoding/csv" 
    "github.com/gocolly/colly" 
    "log" 
    "os" 
) 
 
// defining a data structure to store the scraped data 
type PokemonProduct struct { 
    url, image, name, price string 
} 
 
// it verifies if a string is present in a slice 
func contains(s []string, str string) bool { 
    for _, v := range s { 
        if v == str { 
            return true 
        } 
    } 
 
    return false 
} 
 
func main() { 
    // initializing the slice of structs that will contain the scraped data 
    var pokemonProducts []PokemonProduct 
 
    // initializing the list of pages to scrape with an empty slice 
    var pagesToScrape []string 
 
    // the first pagination URL to scrape 
    pageToScrape := "https://scrapeme.live/shop/page/1/" 
 
    // initializing the list of pages discovered with a pageToScrape 
    pagesDiscovered := []string{ pageToScrape } 
 
    // current iteration 
    i := 1 
    // max pages to scrape 
    limit := 5 
 
    // initializing a Colly instance 
    c := colly.NewCollector() 
    // setting a valid User-Agent header 
    c.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 
 
    // iterating over the list of pagination links to implement the crawling logic 
    c.OnHTML("a.page-numbers", func(e *colly.HTMLElement) { 
        // discovering a new page 
        newPaginationLink := e.Attr("href") 
 
        // if the page discovered is new 
        if !contains(pagesToScrape, newPaginationLink) { 
            // if the page discovered should be scraped 
            if !contains(pagesDiscovered, newPaginationLink) { 
                pagesToScrape = append(pagesToScrape, newPaginationLink) 
            } 
            pagesDiscovered = append(pagesDiscovered, newPaginationLink) 
        } 
    }) 
 
    // scraping the product data 
    c.OnHTML("li.product", func(e *colly.HTMLElement) { 
        pokemonProduct := PokemonProduct{} 
 
        pokemonProduct.url = e.ChildAttr("a", "href") 
        pokemonProduct.image = e.ChildAttr("img", "src") 
        pokemonProduct.name = e.ChildText("h2") 
        pokemonProduct.price = e.ChildText(".price") 
 
        pokemonProducts = append(pokemonProducts, pokemonProduct) 
    }) 
 
    c.OnScraped(func(response *colly.Response) { 
        // until there is still a page to scrape 
        if len(pagesToScrape) != 0 && i < limit { 
            // getting the current page to scrape and removing it from the list 
            pageToScrape = pagesToScrape[0] 
            pagesToScrape = pagesToScrape[1:] 
 
            // incrementing the iteration counter 
            i++ 
 
            // visiting a new page 
            c.Visit(pageToScrape) 
        } 
    }) 
 
    // visiting the first page 
    c.Visit(pageToScrape) 
 
    // opening the CSV file 
    file, err := os.Create("products.csv") 
    if err != nil { 
        log.Fatalln("Failed to create output CSV file", err) 
    } 
    defer file.Close() 
 
    // initializing a file writer 
    writer := csv.NewWriter(file) 
 
    // defining the CSV headers 
    headers := []string{ 
        "url", 
        "image", 
        "name", 
        "price", 
    } 
    // writing the column headers 
    writer.Write(headers) 
 
    // adding each Pokemon product to the CSV output file 
    for _, pokemonProduct := range pokemonProducts { 
        // converting a PokemonProduct to an array of strings 
        record := []string{ 
            pokemonProduct.url, 
            pokemonProduct.image, 
            pokemonProduct.name, 
            pokemonProduct.price, 
        } 
 
        // writing a new CSV record 
        writer.Write(record) 
    } 
    defer writer.Flush() 
}

在大约 100 行代码中，您使用 Golang 构建了一个网络抓取工具！

Go 的其他 Web 抓取库

其他用于使用 Golang 进行网络抓取的优秀库是：

ZenRows：一个完整的网络抓取 API，可以为您处理所有反机器人绕过。它具有无头浏览器功能、CAPTCHA 绕过、旋转代理等功能。
GoQuery：一个 Go 库，提供类似于jQuery 的语法和一组功能。您可以像在 JQuery 中一样使用它来执行Web 抓取。
Ferret：一种便携式、可扩展且快速的网络抓取系统，旨在简化从网络中提取数据的过程。Ferret 允许用户专注于数据并基于独特的声明性语言。
Selenium：可能是最著名的无头浏览器，非常适合抓取动态内容。它不提供官方支持，但有一个端口可以在 Go 中使用它。

结论

在这个循序渐进的 Go 教程中，您看到了开始使用 Golang 网络抓取的构建块。

回顾一下，您了解到：

如何使用 Colly 使用 Go 执行基本数据抓取。
如何实现访问整个网站的爬虫逻辑。
您可能需要 Go 无头浏览器解决方案的原因。
如何使用 chromedp 抓取动态内容网站。

由于多个网站实施了反抓取措施，抓取可能变得具有挑战性。许多图书馆都在努力绕过这些障碍。避免这些问题的最佳做法是使用网络抓取 API。

如何使用Go语言进行网页爬取

准备工作

设置环境

设置一个 Go 项目

如何在 Go 中抓取网站

第 1 步：开始使用 Colly

第 2 步：访问目标 HTML 页面

第 3 步：找到感兴趣的 HTML 元素

第 4 步：从选定的 HTML 元素中抓取产品数据

第 5 步：将抓取的数据转换为 CSV

使用 Golang 进行网页抓取的高级技术

使用 Go 进行网络爬虫

避免被封锁

Golang 中的并行 Web 抓取

在 Go 中使用无头浏览器抓取动态内容网站

放在一起：最终代码

Go 的其他 Web 抓取库

结论

相关

如何使用Puppeteer设置代理?

YouTube视频是如何走红的（病毒式传播）

如何使用python cloudscraper绕过cloudflare

如何使用Python抓取动态网页数据

如何绕过网站抓取时的速率限制

如何使用cURL进行网页抓取

准备工作

设置环境

设置一个 Go 项目

如何在 Go 中抓取网站

第 1 步：开始使用 Colly

第 2 步：访问目标 HTML 页面

第 3 步：找到感兴趣的 HTML 元素

第 4 步：从选定的 HTML 元素中抓取产品数据

第 5 步：将抓取的数据转换为 CSV

使用 Golang 进行网页抓取的高级技术

使用 Go 进行网络爬虫

避免被封锁

Golang 中的并行 Web 抓取

在 Go 中使用无头浏览器抓取动态内容网站

放在一起：最终代码

Go 的其他 Web 抓取库

结论

相关

类似文章