datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble

Github星跟蹤圖

datasketch: Big Data Looks Small

.. image:: https://travis-ci.org/ekzhu/datasketch.svg?branch=master
:target: https://travis-ci.org/ekzhu/datasketch
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.290602.svg
:target: https://doi.org/10.5281/zenodo.290602

datasketch gives you probabilistic data structures that can process and
search very large amount of data super fast, with little loss of
accuracy.

This package contains the following data sketches:

+-------------------------+-----------------------------------------------+, Data Sketch, Usage, +=========================+===============================================+, MinHash, estimate Jaccard similarity and cardinality, +-------------------------+-----------------------------------------------+, Weighted MinHash, estimate weighted Jaccard similarity, +-------------------------+-----------------------------------------------+, HyperLogLog, estimate cardinality, +-------------------------+-----------------------------------------------+, HyperLogLog++, estimate cardinality, +-------------------------+-----------------------------------------------+

The following indexes for data sketches are provided to support
sub-linear query time:

+---------------------------+-----------------------------+------------------------+, Index, For Data Sketch, Supported Query Type, +===========================+=============================+========================+, MinHash LSH, MinHash, Weighted MinHash, Jaccard Threshold, +---------------------------+-----------------------------+------------------------+, MinHash LSH Forest, MinHash, Weighted MinHash, Jaccard Top-K, +---------------------------+-----------------------------+------------------------+, MinHash LSH Ensemble_, MinHash, Containment Threshold, +---------------------------+-----------------------------+------------------------+

datasketch must be used with Python 2.7 or above and NumPy 1.11 or
above. Scipy is optional, but with it the LSH initialization can be much
faster.

Note that MinHash LSH_ and MinHash LSH Ensemble_ also support Redis and Cassandra
storage layer (see MinHash LSH at Scale_).

Install

To install datasketch using pip:

::

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

::

pip install datasketch[redis]

To install with Cassandra dependency:

::

pip install datasketch[cassandra]

To install with Scipy for faster MinHashLSH initialization:

::

pip install datasketch[scipy]

.. _MinHash: https://ekzhu.github.io/datasketch/minhash.html
.. _Weighted MinHash: https://ekzhu.github.io/datasketch/weightedminhash.html
.. _HyperLogLog: https://ekzhu.github.io/datasketch/hyperloglog.html
.. _HyperLogLog++: https://ekzhu.github.io/datasketch/hyperloglog.html#hyperloglog-plusplus
.. _MinHash LSH: https://ekzhu.github.io/datasketch/lsh.html
.. _MinHash LSH Forest: https://ekzhu.github.io/datasketch/lshforest.html
.. _MinHash LSH Ensemble: https://ekzhu.github.io/datasketch/lshensemble.html
.. _Minhash LSH at Scale: http://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale

主要指標

概覽
名稱與所有者ekzhu/datasketch
主編程語言Python
編程語言Python (語言數: 2)
平台
許可證MIT License
所有者活动
創建於2015-03-20 01:21:46
推送於2024-06-04 00:43:43
最后一次提交2024-03-26 14:24:40
發布數33
最新版本名稱v1.6.5 (發布於 )
第一版名稱v0.2.1 (發布於 )
用户参与
星數2.7k
關注者數48
派生數302
提交數237
已啟用問題?
問題數167
打開的問題數49
拉請求數65
打開的拉請求數6
關閉的拉請求數8
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?