wpull

Wget-compatible web downloader and crawler.

  • 所有者: ArchiveTeam/wpull
  • 平台:
  • 许可证: GNU General Public License v3.0
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

=====
Wpull

Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web
downloader and crawler.

.. image:: https://raw.githubusercontent.com/chfoo/wpull/master/icon/wpull_logo_full.png
:target: https://github.com/chfoo/wpull
:alt: A dog pulling a box via a harness.

Notable Features:

  • Written in Python: lightweight, modifiable, robust, & scriptable
  • Graceful stopping; on-disk database resume
  • PhantomJS & youtube-dl integration (experimental)

Install

Wpull uses Python 3 <http://python.org/download/>_.

Once Python is installed, download Wpull from PyPI using pip::

pip3 install wpull

For detailed installation instructions and potential caveats, please see
https://wpull.readthedocs.io/en/master/install.html.

Example Commands

To download the About page of Google.com::

wpull google.com/about

To archive a website::

wpull billy.blogsite.example \
    --warc-file blogsite-billy \
    --no-check-certificate \
    --no-robots --user-agent "InconspiuousWebBrowser/1.0" \
    --wait 0.5 --random-wait --waitretry 600 \
    --page-requisites --recursive --level inf \
    --span-hosts-allow linked-pages,page-requisites \
    --escaped-fragment --strip-session-id \
    --sitemaps \
    --reject-regex "/login\.php" \
    --tries 3 --retry-connrefused --retry-dns-error \
    --timeout 60 --session-timeout 21600 \
    --delete-after --database blogsite-billy.db \
    --quiet --output-file blogsite-billy.log

To see all options::

wpull --help

Documentation

Documentation is located at https://wpull.readthedocs.io/. Please have
a look at it before using Wpull's advanced features.

Help

Need help? Please see our Help <https://wpull.readthedocs.io/en/master/help.html>_ page which contains
frequently asked questions and support information.

The issue tracker is located at https://github.com/chfoo/wpull/issues.

Dev

.. image:: https://travis-ci.org/ArchiveTeam/wpull.png
:target: https://travis-ci.org/ArchiveTeam/wpull
:alt: Travis CI build status

.. image:: https://coveralls.io/repos/chfoo/wpull/badge.png
:target: https://coveralls.io/r/chfoo/wpull
:alt: Coveralls report

Contributions and feedback are greatly appreciated.

Credits

Copyright 2013-2016 by Christopher Foo and others. License GPL v3.

This project contains third-party source code licensed under different terms:

  • wpull.backport.logging
  • wpull.thirdparty.robotexclusionrulesparser
  • wpull.thirdparty.dammit

We would like to acknowledge the authors of GNU Wget as Wpull uses algorithms
from Wget.

主要指标

概览
名称与所有者ArchiveTeam/wpull
主编程语言HTML
编程语言Shell (语言数: 7)
平台
许可证GNU General Public License v3.0
所有者活动
创建于2013-12-07 13:03:15
推送于2024-04-29 12:41:59
最后一次提交2019-10-30 13:25:06
发布数97
最新版本名称v2.0.3 (发布于 2018-10-10 16:44:56)
第一版名称v0.4 (发布于 2014-01-15 15:50:33)
用户参与
星数587
关注者数22
派生数83
提交数2k
已启用问题?
问题数460
打开的问题数182
拉请求数12
打开的拉请求数23
关闭的拉请求数3
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?