python-simhash

An efficient simhash implementation for python

  • 所有者: scrapinghub/python-simhash
  • 平台:
  • 许可证: BSD 3-Clause "New" or "Revised" License
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

=======
simhash

This is an efficient implementation of some functions that are useful for implementing near duplicate detection based on Charikar's simhash <http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf>_. It is a python module, written in C with GCC extentions, and includes the following functions:

fingerprint
Generates a fingerprint from a sequence of hashes

weighted_fingerprint
Generate a fingerprint from a sequence of (long long, weight) tuples

fnvhash
Generates a (FNV-1a) hash from a string

hamming_distance
Calculate the number of bits that differ between 2 long long integers

simpair_indices
Find the indices of hashes in a sequence that differ by less than a certain number of bits. It includes arguments for rotating and grouping hashes. It can be used to help efficiently implement online or batch near duplicate detection, for example as described in Detecting Near-Duplicates for Web Crawling <http://www.wwwconference.org/www2007/papers/paper215.pdf>_ by Gurmeet Manku, Arvind Jain, and Anish Sarma.

Example usage

Generate hashes::

>>> from simhash import fingerprint
>>> hash1 = fingerprint(map(hash, "some text we want to hash"))
>>> hash2 = fingerprint(map(hash, "some more text we want to hash"))

Measure distance between hashes::

>>> from simhash import hamming_distance
>>> hamming_distance(hash1, hash2)
2L

This code was used from mapreduce jobs against a large dataset of webpages as part of a prototype at Scrapinghub.

主要指标

概览
名称与所有者scrapinghub/python-simhash
主编程语言C
编程语言Python (语言数: 2)
平台
许可证BSD 3-Clause "New" or "Revised" License
所有者活动
创建于2014-08-05 12:23:46
推送于2019-10-25 15:27:04
最后一次提交2018-04-23 20:28:51
发布数0
用户参与
星数125
关注者数12
派生数31
提交数5
已启用问题?
问题数3
打开的问题数1
拉请求数2
打开的拉请求数3
关闭的拉请求数1
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?