goq

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

Github星跟蹤圖

goq

Build Status
GoDoc
Coverage Status
Go Report Card

Example

import (
	"log"
	"net/http"

	"astuart.co/goq"
)

// Structured representation for github file name table
type example struct {
	Title string `goquery:"h1"`
	Files []string `goquery:"table.files tbody tr.js-navigation-item td.content,text"`
}

func main() {
	res, err := http.Get("https://github.com/andrewstuart/goq")
	if err != nil {
		log.Fatal(err)
	}
	defer res.Body.Close()

	var ex example
	
	err = goq.NewDecoder(res.Body).Decode(&ex)
	if err != nil {
		log.Fatal(err)
	}

	log.Println(ex.Title, ex.Files)
}

Details

goq

--
import "astuart.co/goq"

Package goq was built to allow users to declaratively unmarshal HTML into go
structs using struct tags composed of css selectors.

I've made a best effort to behave very similarly to JSON and XML decoding as
well as exposing as much information as possible in the event of an error to
help you debug your Unmarshaling issues.

When creating struct types to be unmarshaled into, the following general rules
apply:

  • Any type that implements the Unmarshaler interface will be passed a slice of
    *html.Node so that manual unmarshaling may be done. This takes the highest
    precedence.

  • Any struct fields may be annotated with goquery metadata, which takes the form
    of an element selector followed by arbitrary comma-separated "value selectors."

  • A value selector may be one of html, text, or [someAttrName]. html and
    text will result in the methods of the same name being called on the
    *goquery.Selection to obtain the value. [someAttrName] will result in
    *goquery.Selection.Attr("someAttrName") being called for the value.

  • A primitive value type will default to the text value of the resulting nodes
    if no value selector is given.

  • At least one value selector is required for maps, to determine the map key.
    The key type must follow both the rules applicable to go map indexing, as well
    as these unmarshaling rules. The value of each key will be unmarshaled in the
    same way the element value is unmarshaled.

  • For maps, keys will be retreived from the same level of the DOM. The key
    selector may be arbitrarily nested, though. The first level of children with any
    number of matching elements will be used, though.

  • For maps, any values must be nested below the level of the key selector.
    Parents or siblings of the element matched by the key selector will not be
    considered.

  • Once used, a "value selector" will be shifted off of the comma-separated list.
    This allows you to nest arbitrary levels of value selectors. For example, the
    type []map[string][]string would require one selector for the map key, and
    take an optional second selector for the values of the string slice.

  • Any struct type encountered in nested types (e.g. map[string]SomeStruct) will
    override any remaining "value selectors" that had not been used. For example,
    given:

    struct S {
    F string goquery:",[bang]"
    }

    struct {
    T map[string]S goquery:"#someId,[foo],[bar],[baz]"
    }

[foo] will be used to determine the string map key,but [bar] and [baz]
will be ignored, with the [bang] tag present S struct type taking precedence.

Usage

func NodeSelector

func NodeSelector(nodes []*html.Node) *goquery.Selection

NodeSelector is a quick utility function to get a goquery.Selection from a slice
of *html.Node. Useful for performing unmarshaling, since the decision was made
to use []*html.Node for maximum flexibility.

func Unmarshal

func Unmarshal(bs []byte, v interface{}) error

Unmarshal takes a byte slice and a destination pointer to any interface{}, and
unmarshals the document into the destination based on the rules above. Any error
returned here will likely be of type CannotUnmarshalError, though an initial
goquery error will pass through directly.

func UnmarshalSelection

func UnmarshalSelection(s *goquery.Selection, iface interface{}) error

UnmarshalSelection will unmarshal a goquery.goquery.Selection into an interface
appropriately annoated with goquery tags.

type CannotUnmarshalError

type CannotUnmarshalError struct {
	Err      error
	Val      string
	FldOrIdx interface{}
}

CannotUnmarshalError represents an error returned by the goquery Unmarshaler and
helps consumers in programmatically diagnosing the cause of their error.

func (*CannotUnmarshalError) Error

func (e *CannotUnmarshalError) Error() string

type Decoder

type Decoder struct {
}

Decoder implements the same API you will see in encoding/xml and encoding/json
except that we do not currently support proper streaming decoding as it is not
supported by goquery upstream.

func NewDecoder

func NewDecoder(r io.Reader) *Decoder

NewDecoder returns a new decoder given an io.Reader

func (*Decoder) Decode

func (d *Decoder) Decode(dest interface{}) error

Decode will unmarshal the contents of the decoder when given an instance of an
annotated type as its argument. It will return any errors encountered during
either parsing the document or unmarshaling into the given object.

type Unmarshaler

type Unmarshaler interface {
	UnmarshalHTML([]*html.Node) error
}

Unmarshaler allows for custom implementations of unmarshaling logic

TODO

  • Callable goquery methods with args, via reflection

主要指標

概覽
名稱與所有者andrewstuart/goq
主編程語言Go
編程語言Go (語言數: 1)
平台
許可證MIT License
所有者活动
創建於2017-02-20 02:54:40
推送於2021-09-02 04:20:26
最后一次提交2021-09-01 21:20:17
發布數1
最新版本名稱v1.0.0 (發布於 )
第一版名稱v1.0.0 (發布於 )
用户参与
星數268
關注者數8
派生數21
提交數73
已啟用問題?
問題數12
打開的問題數3
拉請求數2
打開的拉請求數0
關閉的拉請求數1
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?