robotstxt

The robots.txt exclusion protocol implementation for Go language

What

This is a robots.txt exclusion protocol implementation for Go language (golang).

Build

To build and run tests run go test in source directory.

Contribute

Warm welcome.

  • If desired, add your name in README.rst, section Who.
  • Run script/test && script/clean && echo ok
  • You can ignore linter warnings, but everything else must pass.
  • Send your change as pull request or just a regular patch to current maintainer (see section Who).

Thank you.

Usage

As usual, no special installation is required, just

import "github.com/temoto/robotstxt"

run go get and you're ready.

  1. Parse
    ^^^^^^^^

First of all, you need to parse robots.txt data. You can do it with
functions FromBytes(body []byte) (*RobotsData, error) or same for string::

robots, err := robotstxt.FromBytes([]byte("User-agent: *\nDisallow:"))
robots, err := robotstxt.FromString("User-agent: *\nDisallow:")

As of 2012-10-03, FromBytes is the most efficient method, everything else
is a wrapper for this core function.

There are few convenient constructors for various purposes:

  • FromResponse(*http.Response) (*RobotsData, error) to init robots data
    from HTTP response. It does not call response.Body.Close()::

    robots, err := robotstxt.FromResponse(resp)
    resp.Body.Close()
    if err != nil {
    log.Println("Error parsing robots.txt:", err.Error())
    }

  • FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error) or
    FromStatusAndString if you prefer to read bytes (string) yourself.
    Passing status code applies following logic in line with Google's interpretation
    of robots.txt files:

    • status 2xx -> parse body with FromBytes and apply rules listed there.
    • status 4xx -> allow all (even 401/403, as recommended by Google).
    • other (5xx) -> disallow all, consider this a temporary unavailability.
  1. Query
    ^^^^^^^^

Parsing robots.txt content builds a kind of logic database, which you can
query with (r *RobotsData) TestAgent(url, agent string) (bool).

Explicit passing of agent is useful if you want to query for different agents. For
single agent users there is an efficient option: RobotsData.FindGroup(userAgent string)
returns a structure with .Test(path string) method and .CrawlDelay time.Duration.

Simple query with explicit user agent. Each call will scan all rules.

::

allow := robots.TestAgent("/", "FooBot")

Or query several paths against same user agent for performance.

::

group := robots.FindGroup("BarBot")
group.Test("/")
group.Test("/download.mp3")
group.Test("/news/article-2012-1")

Who

Honorable contributors (in undefined order):

* Ilya Grigorik (igrigorik)
* Martin Angers (PuerkitoBio)
* Micha Gorelick (mynameisfiber)

Initial commit and other: Sergey Shepelev temotor@gmail.com

Flair

.. image:: https://travis-ci.org/temoto/robotstxt.svg?branch=master
:target: https://travis-ci.org/temoto/robotstxt

.. image:: https://codecov.io/gh/temoto/robotstxt/branch/master/graph/badge.svg
:target: https://codecov.io/gh/temoto/robotstxt

.. image:: https://goreportcard.com/badge/github.com/temoto/robotstxt
:target: https://goreportcard.com/report/github.com/temoto/robotstxt

Main metrics

Overview
Name With Ownertemoto/robotstxt
Primary LanguageGo
Program languageGo (Language Count: 2)
Platform
License:MIT License
所有者活动
Created At2010-07-12 10:54:05
Pushed At2022-11-09 09:51:34
Last Commit At2022-11-09 12:51:34
Release Count3
Last Release Namev1.1.2 (Posted on 2021-04-01 00:20:47)
First Release Namev1.1.0 (Posted on 2019-07-04 12:33:00)
用户参与
Stargazers Count276
Watchers Count9
Fork Count56
Commits Count67
Has Issues Enabled
Issues Count19
Issue Open Count4
Pull Requests Count11
Pull Requests Open Count0
Pull Requests Close Count11
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private