robotstxt

The robots.txt exclusion protocol implementation for Go language

What

This is a robots.txt exclusion protocol implementation for Go language (golang).

Build

To build and run tests run go test in source directory.

Contribute

Warm welcome.

  • If desired, add your name in README.rst, section Who.
  • Run script/test && script/clean && echo ok
  • You can ignore linter warnings, but everything else must pass.
  • Send your change as pull request or just a regular patch to current maintainer (see section Who).

Thank you.

Usage

As usual, no special installation is required, just

import "github.com/temoto/robotstxt"

run go get and you're ready.

  1. Parse
    ^^^^^^^^

First of all, you need to parse robots.txt data. You can do it with
functions FromBytes(body []byte) (*RobotsData, error) or same for string::

robots, err := robotstxt.FromBytes([]byte("User-agent: *\nDisallow:"))
robots, err := robotstxt.FromString("User-agent: *\nDisallow:")

As of 2012-10-03, FromBytes is the most efficient method, everything else
is a wrapper for this core function.

There are few convenient constructors for various purposes:

  • FromResponse(*http.Response) (*RobotsData, error) to init robots data
    from HTTP response. It does not call response.Body.Close()::

    robots, err := robotstxt.FromResponse(resp)
    resp.Body.Close()
    if err != nil {
    log.Println("Error parsing robots.txt:", err.Error())
    }

  • FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error) or
    FromStatusAndString if you prefer to read bytes (string) yourself.
    Passing status code applies following logic in line with Google's interpretation
    of robots.txt files:

    • status 2xx -> parse body with FromBytes and apply rules listed there.
    • status 4xx -> allow all (even 401/403, as recommended by Google).
    • other (5xx) -> disallow all, consider this a temporary unavailability.
  1. Query
    ^^^^^^^^

Parsing robots.txt content builds a kind of logic database, which you can
query with (r *RobotsData) TestAgent(url, agent string) (bool).

Explicit passing of agent is useful if you want to query for different agents. For
single agent users there is an efficient option: RobotsData.FindGroup(userAgent string)
returns a structure with .Test(path string) method and .CrawlDelay time.Duration.

Simple query with explicit user agent. Each call will scan all rules.

::

allow := robots.TestAgent("/", "FooBot")

Or query several paths against same user agent for performance.

::

group := robots.FindGroup("BarBot")
group.Test("/")
group.Test("/download.mp3")
group.Test("/news/article-2012-1")

Who

Honorable contributors (in undefined order):

* Ilya Grigorik (igrigorik)
* Martin Angers (PuerkitoBio)
* Micha Gorelick (mynameisfiber)

Initial commit and other: Sergey Shepelev temotor@gmail.com

Flair

.. image:: https://travis-ci.org/temoto/robotstxt.svg?branch=master
:target: https://travis-ci.org/temoto/robotstxt

.. image:: https://codecov.io/gh/temoto/robotstxt/branch/master/graph/badge.svg
:target: https://codecov.io/gh/temoto/robotstxt

.. image:: https://goreportcard.com/badge/github.com/temoto/robotstxt
:target: https://goreportcard.com/report/github.com/temoto/robotstxt

主要指標

概覽
名稱與所有者temoto/robotstxt
主編程語言Go
編程語言Go (語言數: 2)
平台
許可證MIT License
所有者活动
創建於2010-07-12 10:54:05
推送於2022-11-09 09:51:34
最后一次提交2022-11-09 12:51:34
發布數3
最新版本名稱v1.1.2 (發布於 2021-04-01 00:20:47)
第一版名稱v1.1.0 (發布於 2019-07-04 12:33:00)
用户参与
星數276
關注者數9
派生數56
提交數67
已啟用問題?
問題數19
打開的問題數4
拉請求數11
打開的拉請求數0
關閉的拉請求數11
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?