python-simhash

An efficient simhash implementation for python

  • 所有者: scrapinghub/python-simhash
  • 平台:
  • 許可證: BSD 3-Clause "New" or "Revised" License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

=======
simhash

This is an efficient implementation of some functions that are useful for implementing near duplicate detection based on Charikar's simhash <http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf>_. It is a python module, written in C with GCC extentions, and includes the following functions:

fingerprint
Generates a fingerprint from a sequence of hashes

weighted_fingerprint
Generate a fingerprint from a sequence of (long long, weight) tuples

fnvhash
Generates a (FNV-1a) hash from a string

hamming_distance
Calculate the number of bits that differ between 2 long long integers

simpair_indices
Find the indices of hashes in a sequence that differ by less than a certain number of bits. It includes arguments for rotating and grouping hashes. It can be used to help efficiently implement online or batch near duplicate detection, for example as described in Detecting Near-Duplicates for Web Crawling <http://www.wwwconference.org/www2007/papers/paper215.pdf>_ by Gurmeet Manku, Arvind Jain, and Anish Sarma.

Example usage

Generate hashes::

>>> from simhash import fingerprint
>>> hash1 = fingerprint(map(hash, "some text we want to hash"))
>>> hash2 = fingerprint(map(hash, "some more text we want to hash"))

Measure distance between hashes::

>>> from simhash import hamming_distance
>>> hamming_distance(hash1, hash2)
2L

This code was used from mapreduce jobs against a large dataset of webpages as part of a prototype at Scrapinghub.

主要指標

概覽
名稱與所有者scrapinghub/python-simhash
主編程語言C
編程語言Python (語言數: 2)
平台
許可證BSD 3-Clause "New" or "Revised" License
所有者活动
創建於2014-08-05 12:23:46
推送於2019-10-25 15:27:04
最后一次提交2018-04-23 20:28:51
發布數0
用户参与
星數125
關注者數12
派生數31
提交數5
已啟用問題?
問題數3
打開的問題數1
拉請求數2
打開的拉請求數3
關閉的拉請求數1
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?