php-spider

A configurable and extensible PHP web spider

  • 所有者: mvdbos/php-spider
  • 平台:
  • 许可证: MIT License
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

Build Status
Latest Stable Version
Total Downloads
License

PHP-Spider Features

  • supports two traversal algorithms: breadth-first and depth-first
  • supports crawl depth limiting, queue size limiting and max downloads limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as Domain limiting
  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
  • supports custom request handling logic
  • comes with a useful set of persistence handlers (memory, file. Redis soon to follow)
  • supports custom persistence handlers
  • collects statistics about the crawl for reporting
  • dispatches useful events, allowing developers to add even more custom behavior
  • supports a politeness policy
  • will soon come with many default discoverers: RSS, Atom, RDF, etc.
  • will soon support multiple queueing mechanisms (file, memcache, redis)
  • will eventually support distributed spidering with a central queue

Installation

The easiest way to install PHP-Spider is with composer. Find it on Packagist.

Usage

This is a very simple example. This code can be found in example/example_simple.php. For a more complete example with some logging, caching and filters, see example/example_complex.php. That file contains a more real-world example.

First create the spider

$spider = new Spider('http://www.dmoz.org');

Add a URI discoverer. Without it, the spider does nothing. In this case, we want all <a> nodes from a certain <div>

$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));

Set some sane options for this example. In this case, we only get the first 10 items from the start page.

$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;

Add a listener to collect stats from the Spider and the QueueManager.
There are more components that dispatch events you can use.

$statsHandler = new StatsHandler();
$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($statsHandler);

Execute the crawl

$spider->crawl();

When crawling is done, we could get some info about the crawl

echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());

Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources

echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
    echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();
}

Contributing

Contributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request.
The Symfony documentation contains an excellent guide for how to do that properly here: Submitting a Patch.

There a few requirements for a Pull Request to be accepted:

  • Follow the coding standards: PHP-Spider follows the coding standards defined in the PSR-0, PSR-1 and PSR-2 Coding Style Guides;
  • Prove that the code works with unit tests;

Note: An easy way to check if your code conforms to PHP-Spider is by running PHP CodeSniffer on your local code. Please make sure you use the PSR-2 standard: --standard=PSR2

Support

For things like reporting bugs and requesting features it is best to create an issue here on GitHub. It is even better to accompany it with a Pull Request. ;-)

License

PHP-Spider is licensed under the MIT license.

主要指标

概览
名称与所有者mvdbos/php-spider
主编程语言PHP
编程语言PHP (语言数: 3)
平台
许可证MIT License
所有者活动
创建于2013-02-25 19:27:52
推送于2025-03-02 19:42:24
最后一次提交2025-02-15 23:25:33
发布数19
最新版本名称v0.7.3 (发布于 )
第一版名称v0.1 (发布于 )
用户参与
星数1.3k
关注者数84
派生数231
提交数256
已启用问题?
问题数48
打开的问题数4
拉请求数54
打开的拉请求数0
关闭的拉请求数6
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?