creeper

:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

Github stars Tracking Chart

License
Go Report Card
Gitter
Creeper

About

Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning: At present this project is still under early stage development, please do not use in the production environment.

Get Started

Installation

$ go get github.com/wspl/creeper

Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Then, create main.go

package main

import "github.com/wspl/creeper"

func main() {
	c := creeper.Open("./hacker_news.crs")
	c.Array("news").Each(func(c *creeper.Creeper) {
		println("title: ", c.String("title"))
		println("site: ", c.String("site"))
		println("link: ", c.String("link"))
		println("===")
	})
}

Build and run. Console will print something like:

title:  Samsung chief Lee arrested as S.Korean corruption probe deepens
site:  reuters.com
link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title:  ReactOS 0.4.4 Released
site:  reactos.org
link:  https://reactos.org/project-news/reactos-044-released
===
title:  FeFETs: How this new memory stacks up against existing non-volatile memory
site:  semiengineering.com
link:  http://semiengineering.com/what-are-fefets/

Script Spec

Town

Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

You might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something", represents parameter name has a default value "something".

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.

Node

Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing")
	title: $(".title a.storylink").text
	site: $(".title span.sitestr").text
	link: $(".title a.storylink").href

Like yaml, nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.

Page

Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs.

If a node owned page and fun at the same time, page should on the left of ->, fun should on the right of ->. Which is page -> fun

Fun

Fun represents the data processing process.

There are all supported funs:, Name, Parameters, Description, ---------, --------------------------------, ----------------------------------------, $, (selector: string), Relative CSS selector (select from parent node), $root, (selector: string), Absolute CSS selector (select from body), html, inner HTML, text, inner text, outerHTML, outer HTML, attr, (attr: string), attribute value, style, style attribute value, href, href attribute value, src, src attribute value, class, class attribute value, id, id attribute value, calc, (prec: int), calculate arithmetic expression, match, (regexp: string), match first sub-string via regular expression, expand, (regexp: string, target: string), expand matched strings to target string, ## Author

Plutonist

impl.moe · Github @wspl

Overview

Name With Ownerwspl/creeper
Primary LanguageGo
Program languageGo (Language Count: 1)
Platform
License:Apache License 2.0
Release Count0
Created At2017-02-17 03:01:50
Pushed At2017-05-16 12:14:14
Last Commit At2017-05-16 20:14:13
Stargazers Count776
Watchers Count47
Fork Count62
Commits Count51
Has Issues Enabled
Issues Count9
Issue Open Count5
Pull Requests Count0
Pull Requests Open Count0
Pull Requests Close Count0
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private
To the top