datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble

Github星跟踪图

datasketch: Big Data Looks Small

.. image:: https://travis-ci.org/ekzhu/datasketch.svg?branch=master
:target: https://travis-ci.org/ekzhu/datasketch
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.290602.svg
:target: https://doi.org/10.5281/zenodo.290602

datasketch gives you probabilistic data structures that can process and
search very large amount of data super fast, with little loss of
accuracy.

This package contains the following data sketches:

+-------------------------+-----------------------------------------------+, Data Sketch, Usage, +=========================+===============================================+, MinHash, estimate Jaccard similarity and cardinality, +-------------------------+-----------------------------------------------+, Weighted MinHash, estimate weighted Jaccard similarity, +-------------------------+-----------------------------------------------+, HyperLogLog, estimate cardinality, +-------------------------+-----------------------------------------------+, HyperLogLog++, estimate cardinality, +-------------------------+-----------------------------------------------+

The following indexes for data sketches are provided to support
sub-linear query time:

+---------------------------+-----------------------------+------------------------+, Index, For Data Sketch, Supported Query Type, +===========================+=============================+========================+, MinHash LSH, MinHash, Weighted MinHash, Jaccard Threshold, +---------------------------+-----------------------------+------------------------+, MinHash LSH Forest, MinHash, Weighted MinHash, Jaccard Top-K, +---------------------------+-----------------------------+------------------------+, MinHash LSH Ensemble_, MinHash, Containment Threshold, +---------------------------+-----------------------------+------------------------+

datasketch must be used with Python 2.7 or above and NumPy 1.11 or
above. Scipy is optional, but with it the LSH initialization can be much
faster.

Note that MinHash LSH_ and MinHash LSH Ensemble_ also support Redis and Cassandra
storage layer (see MinHash LSH at Scale_).

Install

To install datasketch using pip:

::

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

::

pip install datasketch[redis]

To install with Cassandra dependency:

::

pip install datasketch[cassandra]

To install with Scipy for faster MinHashLSH initialization:

::

pip install datasketch[scipy]

.. _MinHash: https://ekzhu.github.io/datasketch/minhash.html
.. _Weighted MinHash: https://ekzhu.github.io/datasketch/weightedminhash.html
.. _HyperLogLog: https://ekzhu.github.io/datasketch/hyperloglog.html
.. _HyperLogLog++: https://ekzhu.github.io/datasketch/hyperloglog.html#hyperloglog-plusplus
.. _MinHash LSH: https://ekzhu.github.io/datasketch/lsh.html
.. _MinHash LSH Forest: https://ekzhu.github.io/datasketch/lshforest.html
.. _MinHash LSH Ensemble: https://ekzhu.github.io/datasketch/lshensemble.html
.. _Minhash LSH at Scale: http://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale

主要指标

概览
名称与所有者ekzhu/datasketch
主编程语言Python
编程语言Python (语言数: 2)
平台
许可证MIT License
所有者活动
创建于2015-03-20 01:21:46
推送于2024-06-04 00:43:43
最后一次提交2024-03-26 14:24:40
发布数33
最新版本名称v1.6.5 (发布于 )
第一版名称v0.2.1 (发布于 )
用户参与
星数2.7k
关注者数48
派生数302
提交数237
已启用问题?
问题数167
打开的问题数49
拉请求数65
打开的拉请求数6
关闭的拉请求数8
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?