datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble

Github stars Tracking Chart

datasketch: Big Data Looks Small

.. image:: https://travis-ci.org/ekzhu/datasketch.svg?branch=master
:target: https://travis-ci.org/ekzhu/datasketch
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.290602.svg
:target: https://doi.org/10.5281/zenodo.290602

datasketch gives you probabilistic data structures that can process and
search very large amount of data super fast, with little loss of
accuracy.

This package contains the following data sketches:

+-------------------------+-----------------------------------------------+, Data Sketch, Usage, +=========================+===============================================+, MinHash, estimate Jaccard similarity and cardinality, +-------------------------+-----------------------------------------------+, Weighted MinHash, estimate weighted Jaccard similarity, +-------------------------+-----------------------------------------------+, HyperLogLog, estimate cardinality, +-------------------------+-----------------------------------------------+, HyperLogLog++, estimate cardinality, +-------------------------+-----------------------------------------------+

The following indexes for data sketches are provided to support
sub-linear query time:

+---------------------------+-----------------------------+------------------------+, Index, For Data Sketch, Supported Query Type, +===========================+=============================+========================+, MinHash LSH, MinHash, Weighted MinHash, Jaccard Threshold, +---------------------------+-----------------------------+------------------------+, MinHash LSH Forest, MinHash, Weighted MinHash, Jaccard Top-K, +---------------------------+-----------------------------+------------------------+, MinHash LSH Ensemble_, MinHash, Containment Threshold, +---------------------------+-----------------------------+------------------------+

datasketch must be used with Python 2.7 or above and NumPy 1.11 or
above. Scipy is optional, but with it the LSH initialization can be much
faster.

Note that MinHash LSH_ and MinHash LSH Ensemble_ also support Redis and Cassandra
storage layer (see MinHash LSH at Scale_).

Install

To install datasketch using pip:

::

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

::

pip install datasketch[redis]

To install with Cassandra dependency:

::

pip install datasketch[cassandra]

To install with Scipy for faster MinHashLSH initialization:

::

pip install datasketch[scipy]

.. _MinHash: https://ekzhu.github.io/datasketch/minhash.html
.. _Weighted MinHash: https://ekzhu.github.io/datasketch/weightedminhash.html
.. _HyperLogLog: https://ekzhu.github.io/datasketch/hyperloglog.html
.. _HyperLogLog++: https://ekzhu.github.io/datasketch/hyperloglog.html#hyperloglog-plusplus
.. _MinHash LSH: https://ekzhu.github.io/datasketch/lsh.html
.. _MinHash LSH Forest: https://ekzhu.github.io/datasketch/lshforest.html
.. _MinHash LSH Ensemble: https://ekzhu.github.io/datasketch/lshensemble.html
.. _Minhash LSH at Scale: http://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale

Main metrics

Overview
Name With Ownerekzhu/datasketch
Primary LanguagePython
Program languagePython (Language Count: 2)
Platform
License:MIT License
所有者活动
Created At2015-03-20 01:21:46
Pushed At2024-06-04 00:43:43
Last Commit At2024-03-26 14:24:40
Release Count33
Last Release Namev1.6.5 (Posted on )
First Release Namev0.2.1 (Posted on )
用户参与
Stargazers Count2.7k
Watchers Count48
Fork Count302
Commits Count237
Has Issues Enabled
Issues Count167
Issue Open Count49
Pull Requests Count65
Pull Requests Open Count6
Pull Requests Close Count8
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private