datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble

datasketch: Big Data Looks Small

.. image:: https://travis-ci.org/ekzhu/datasketch.svg?branch=master
:target: https://travis-ci.org/ekzhu/datasketch
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.290602.svg
:target: https://doi.org/10.5281/zenodo.290602

datasketch gives you probabilistic data structures that can process and
search very large amount of data super fast, with little loss of
accuracy.

This package contains the following data sketches:

+-------------------------+-----------------------------------------------+, Data Sketch, Usage, +=========================+===============================================+, MinHash, estimate Jaccard similarity and cardinality, +-------------------------+-----------------------------------------------+, Weighted MinHash, estimate weighted Jaccard similarity, +-------------------------+-----------------------------------------------+, HyperLogLog, estimate cardinality, +-------------------------+-----------------------------------------------+, HyperLogLog++, estimate cardinality, +-------------------------+-----------------------------------------------+

The following indexes for data sketches are provided to support
sub-linear query time:

+---------------------------+-----------------------------+------------------------+, Index, For Data Sketch, Supported Query Type, +===========================+=============================+========================+, MinHash LSH, MinHash, Weighted MinHash, Jaccard Threshold, +---------------------------+-----------------------------+------------------------+, MinHash LSH Forest, MinHash, Weighted MinHash, Jaccard Top-K, +---------------------------+-----------------------------+------------------------+, MinHash LSH Ensemble_, MinHash, Containment Threshold, +---------------------------+-----------------------------+------------------------+

datasketch must be used with Python 2.7 or above and NumPy 1.11 or
above. Scipy is optional, but with it the LSH initialization can be much
faster.

Note that MinHash LSH_ and MinHash LSH Ensemble_ also support Redis and Cassandra
storage layer (see MinHash LSH at Scale_).

Install

To install datasketch using pip:

::

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

::

pip install datasketch[redis]

To install with Cassandra dependency:

::

pip install datasketch[cassandra]

To install with Scipy for faster MinHashLSH initialization:

::

pip install datasketch[scipy]

.. _MinHash: https://ekzhu.github.io/datasketch/minhash.html
.. _Weighted MinHash: https://ekzhu.github.io/datasketch/weightedminhash.html
.. _HyperLogLog: https://ekzhu.github.io/datasketch/hyperloglog.html
.. _HyperLogLog++: https://ekzhu.github.io/datasketch/hyperloglog.html#hyperloglog-plusplus
.. _MinHash LSH: https://ekzhu.github.io/datasketch/lsh.html
.. _MinHash LSH Forest: https://ekzhu.github.io/datasketch/lshforest.html
.. _MinHash LSH Ensemble: https://ekzhu.github.io/datasketch/lshensemble.html
.. _Minhash LSH at Scale: http://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale

主要指标

概览
名称与所有者ekzhu/datasketch
主编程语言Python
编程语言Python (语言数: 2)
平台
许可证MIT License
所有者活动
创建于2015-03-20 09:21:46
推送于2025-11-05 13:46:59
最后一次提交2025-11-05 13:46:07
发布数34
最新版本名称v1.7.0 (发布于 )
第一版名称v0.2.1 (发布于 )
用户参与
星数2.8k
关注者数45
派生数310
提交数253
已启用问题?
问题数170
打开的问题数48
拉请求数78
打开的拉请求数6
关闭的拉请求数13
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?