htmlquery

htmlquery is golang XPath package for HTML query.

Github星跟踪图

htmlquery

Build Status
Coverage Status
GoDoc
Go Report Card

Overview

htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.

htmlquery built-in the query object caching feature based on LRU, this feature will caching the recently used XPATH query string. Enable query caching can avoid re-compile XPath expression each query.

You can visit this page to learn about the supported XPath(1.0/2.0) syntax. https://github.com/antchfx/xpath

XPath query packages for Go

Name Description
htmlquery XPath query package for the HTML document
xmlquery XPath query package for the XML document
jsonquery XPath query package for the JSON document

Installation

go get github.com/antchfx/htmlquery

Getting Started

Query, returns matched elements or error.

nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
	panic(`not a valid XPath expression.`)
}

Load HTML document from URL.

doc, err := htmlquery.LoadURL("http://example.com/")

Load HTML from document.

filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)

Load HTML document from string.

s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))

Find all A elements.

list := htmlquery.Find(doc, "//a")

Find all A elements that have href attribute.

list := htmlquery.Find(doc, "//a[@href]")	

Find all A elements with href attribute and only return href value.

list := htmlquery.Find(doc, "//a/@href")	
for _ , n := range list{
	fmt.Println(htmlquery.SelectAttr(n, "href")) // output @href value
}

Find the third A element.

a := htmlquery.FindOne(doc, "//a[3]")

Find children element (img) under A href and print the source

a := htmlquery.FindOne(doc, "//a")
img := htmlquery.FindOne(a, "//img")
fmt.Prinln(htmlquery.SelectAttr(img, "src")) // output @src value

Evaluate the number of all IMG element.

expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)

Quick Starts

func main() {
	doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
	if err != nil {
		panic(err)
	}
	// Find all news item.
	list, err := htmlquery.QueryAll(doc, "//ol/li")
	if err != nil {
		panic(err)
	}
	for i, n := range list {
		a := htmlquery.FindOne(n, "//a")
		if a != nil {
		    fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
		}
	}
}

FAQ

Find() vs QueryAll(), which is better?

Find and QueryAll both do the same things, searches all of matched html nodes.
The Find will panics if you give an error XPath query, but QueryAll will return an error for you.

Can I save my query expression object for the next query?

Yes, you can. We offer the QuerySelector and QuerySelectorAll methods, It will accept your query expression object.

Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.

XPath query object cache performance

goos: windows
goarch: amd64
pkg: github.com/antchfx/htmlquery
BenchmarkSelectorCache-4                20000000                55.2 ns/op
BenchmarkDisableSelectorCache-4           500000              3162 ns/op

How to disable caching?

htmlquery.DisableSelectorCache = true

Questions

Please let me know if you have any questions.

主要指标

概览
名称与所有者antchfx/htmlquery
主编程语言Go
编程语言Go (语言数: 1)
平台
许可证MIT License
所有者活动
创建于2017-12-05 01:08:41
推送于2025-06-10 05:25:13
最后一次提交2025-06-10 13:23:58
发布数15
最新版本名称v1.3.4 (发布于 )
第一版名称v1.0.0 (发布于 )
用户参与
星数768
关注者数10
派生数77
提交数134
已启用问题?
问题数60
打开的问题数5
拉请求数9
打开的拉请求数3
关闭的拉请求数5
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?