pup

pup is a command line tool for processing HTML. It reads from stdin,
prints to stdout, and allows the user to filter parts of the page using
CSS selectors.

Inspired by jq, pup aims to be a
fast and flexible way of exploring HTML from the terminal.

Install

Direct downloads are available through the releases page.

If you have Go installed on your computer just run go get.

go get github.com/ericchiang/pup

If you're on OS X, use Homebrew to install (no Go required).

brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb

Quick start

$ curl -s https://news.ycombinator.com/

Ew, HTML. Let's run that through some pup selectors:

$ curl -s https://news.ycombinator.com/, pup 'table table tr:nth-last-of-type(n+2) td.title a'

Okay, how about only the links?

$ curl -s https://news.ycombinator.com/, pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}'

Even better, let's grab the titles too:

$ curl -s https://news.ycombinator.com/, pup 'table table tr:nth-last-of-type(n+2) td.title a json{}'

Basic Usage

$ cat index.html, pup [flags] '[selectors] [display function]'

Examples

Download a webpage with wget.

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

Clean and indent

By default pup will fill in missing tags and properly indent the page.

$ cat robots.html
# nasty looking HTML
$ cat robots.html, pup --color
# cleaned, indented, and colorful HTML

Filter by tag

$ cat robots.html, pup 'title'
<title>
 Robots exclusion standard - Wikipedia, the free encyclopedia
</title>

Filter by id

$ cat robots.html, pup 'span#See_also'
<span class="mw-headline" id="See_also">
 See also
</span>

Filter by attribute

$ cat robots.html, pup 'th[scope="row"]'
<th scope="row" class="navbox-group">
 Exclusion standards
</th>
<th scope="row" class="navbox-group">
 Related marketing topics
</th>
<th scope="row" class="navbox-group">
 Search marketing related topics
</th>
<th scope="row" class="navbox-group">
 Search engine spam
</th>
<th scope="row" class="navbox-group">
 Linking
</th>
<th scope="row" class="navbox-group">
 People
</th>
<th scope="row" class="navbox-group">
 Other
</th>

Pseudo Classes

CSS selectors have a group of specifiers called "pseudo classes" which are pretty
cool. pup implements a majority of the relevant ones them.

Here are some examples.

$ cat robots.html, pup 'a[rel]:empty'
<a rel="license" href="//creativecommons.org/licenses/by-sa/3.0/" style="display:none;">
</a>

$ cat robots.html, pup ':contains("History")'
<span class="toctext">
 History
</span>
<span class="mw-headline" id="History">
 History
</span>

$ cat robots.html, pup ':parent-of([action="edit"])'
<span class="wb-langlinks-edit wb-langlinks-link">
 <a action="edit" href="//www.wikidata.org/wiki/Q80776#sitelinks-wikipedia" text="Edit links" title="Edit interlanguage links" class="wbc-editpage">
  Edit links
 </a>
</span>

For a complete list, view the implemented selectors
section.

`+`, `>`, and `,`

These are intermediate characters that declare special instructions. For
instance, a comma , allows pup to specify multiple groups of selectors.

$ cat robots.html, pup 'title, h1 span[dir="auto"]'
<title>
 Robots exclusion standard - Wikipedia, the free encyclopedia
</title>
<span dir="auto">
 Robots exclusion standard
</span>

Chain selectors together

When combining selectors, the HTML nodes selected by the previous selector will
be passed to the next ones.

$ cat robots.html, pup 'h1#firstHeading'
<h1 id="firstHeading" class="firstHeading" lang="en">
 <span dir="auto">
  Robots exclusion standard
 </span>
</h1>

$ cat robots.html, pup 'h1#firstHeading span'
<span dir="auto">
 Robots exclusion standard
</span>

Implemented Selectors

For further examples of these selectors head over to MDN.

pup '.class'
pup '#id'
pup 'element'
pup 'selector + selector'
pup 'selector > selector'
pup '[attribute]'
pup '[attribute="value"]'
pup '[attribute*="value"]'
pup '[attribute~="value"]'
pup '[attribute^="value"]'
pup '[attribute$="value"]'
pup ':empty'
pup ':first-child'
pup ':first-of-type'
pup ':last-child'
pup ':last-of-type'
pup ':only-child'
pup ':only-of-type'
pup ':contains("text")'
pup ':nth-child(n)'
pup ':nth-of-type(n)'
pup ':nth-last-child(n)'
pup ':nth-last-of-type(n)'
pup ':not(selector)'
pup ':parent-of(selector)'

You can mix and match selectors as you wish.

cat index.html, pup 'element#id[attribute="value"]:first-of-type'

Display Functions

Non-HTML selectors which effect the output type are implemented as functions
which can be provided as a final argument.

`text{}`

Print all text from selected nodes and children in depth first order.

$ cat robots.html, pup '.mw-headline text{}'
History
About the standard
Disadvantages
Alternatives
Examples
Nonstandard extensions
Crawl-delay directive
Allow directive
Sitemap
Host
Universal "*" match
Meta tags and headers
See also
References
External links

`attr{attrkey}`

Print the values of all attributes with a given key from all selected nodes.

$ cat robots.html, pup '.catlinks div attr{id}'
mw-normal-catlinks
mw-hidden-catlinks

`json{}`

Print HTML as JSON.

$ cat robots.html, pup 'div#p-namespaces a'
<a href="/wiki/Robots_exclusion_standard" title="View the content page [c]" accesskey="c">
 Article
</a>
<a href="/wiki/Talk:Robots_exclusion_standard" title="Discussion about the content page [t]" accesskey="t">
 Talk
</a>

$ cat robots.html, pup 'div#p-namespaces a json{}'
[
 {
  "accesskey": "c",
  "href": "/wiki/Robots_exclusion_standard",
  "tag": "a",
  "text": "Article",
  "title": "View the content page [c]"
 },
 {
  "accesskey": "t",
  "href": "/wiki/Talk:Robots_exclusion_standard",
  "tag": "a",
  "text": "Talk",
  "title": "Discussion about the content page [t]"
 }
]

Use the -i / --indent flag to control the intent level.

$ cat robots.html, pup -i 4 'div#p-namespaces a json{}'
[
    {
        "accesskey": "c",
        "href": "/wiki/Robots_exclusion_standard",
        "tag": "a",
        "text": "Article",
        "title": "View the content page [c]"
    },
    {
        "accesskey": "t",
        "href": "/wiki/Talk:Robots_exclusion_standard",
        "tag": "a",
        "text": "Talk",
        "title": "Discussion about the content page [t]"
    }
]

If the selectors only return one element the results will be printed as a JSON
object, not a list.

$ cat robots.html, pup --indent 4 'title json{}'
{
    "tag": "title",
    "text": "Robots exclusion standard - Wikipedia, the free encyclopedia"
}

Because there is no universal standard for converting HTML/XML to JSON, a
method has been chosen which hopefully fits. The goal is simply to get the
output of pup into a more consumable format.

Flags

Run pup --help for a list of further options

名称与所有者	ericchiang/pup
主编程语言	HTML
编程语言	Go (语言数: 5)
平台
许可证	MIT License

创建于	2014-09-01 01:31:29
推送于	2024-05-02 13:43:38
最后一次提交	2022-03-06 12:45:36
发布数	15
最新版本名称	v0.4.0 (发布于 2016-07-22 21:53:20)
第一版名称	0.1.0 (发布于 )

星数	8.3k
关注者数	91
派生数	265
提交数	103
已启用问题?
问题数	150
打开的问题数	82
拉请求数	17
打开的拉请求数	25
关闭的拉请求数	15

已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?

pup

Github星跟踪图

pup

Install

Quick start

Basic Usage

Examples

Clean and indent

Filter by tag

Filter by id

Filter by attribute

Pseudo Classes

`+`, `>`, and `,`

Chain selectors together

Implemented Selectors

Display Functions

`text{}`

`attr{attrkey}`

`json{}`

Flags

主要指标

pup

Github星跟踪图

pup

Install

Quick start

Basic Usage

Examples

Clean and indent

Filter by tag

Filter by id

Filter by attribute

Pseudo Classes

+, >, and ,

Chain selectors together

Implemented Selectors

Display Functions

text{}

attr{attrkey}

json{}

Flags

主要指标

`+`, `>`, and `,`

`text{}`

`attr{attrkey}`

`json{}`