simhash

Go implementation of simhash algoritim

  • 所有者: mfonda/simhash
  • 平台:
  • 許可證: MIT License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

simhash

simhash is a Go implementation of Charikar's simhash algorithm.

simhash is a hash with the useful property that similar documents produce similar hashes.
Therefore, if two documents are similar, the Hamming-distance between the simhash of the
documents will be small.

This package currently just implements the simhash algorithm. Future work will make use of this
package to enable quickly identifying near-duplicate documents within a large collection of
documents.

Installation

go get github.com/mfonda/simhash

Usage

Using simhash first requires tokenizing a document into a set of features (done through the
FeatureSet interface). This package provides an implementation, WordFeatureSet, which breaks
tokenizes the document into individual words. Better results are possible here, and future work
will go towards this.

Example usage:

package main

import (
	"fmt"
	"github.com/mfonda/simhash"
)

func main() {
	var docs = [][]byte{
		[]byte("this is a test phrase"),
		[]byte("this is a test phrass"),
		[]byte("foo bar"),
	}

	hashes := make([]uint64, len(docs))
	for i, d := range docs {
		hashes[i] = simhash.Simhash(simhash.NewWordFeatureSet(d))
		fmt.Printf("Simhash of %s: %x\n", d, hashes[i])
	}

	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[1], simhash.Compare(hashes[0], hashes[1]))
	fmt.Printf("Comparison of `%s` and `%s`: %d\n", docs[0], docs[2], simhash.Compare(hashes[0], hashes[2]))
}

Output:

Simhash of this is a test phrase: 8c3a5f7e9ecb3f35
Simhash of this is a test phrass: 8c3a5f7e9ecb3f21
Simhash of foo bar: d8dbe7186bad3db3
Comparison of `this is a test phrase` and `this is a test phrass`: 2
Comparison of `this is a test phrase` and `foo bar`: 29

主要指標

概覽
名稱與所有者mfonda/simhash
主編程語言Go
編程語言Go (語言數: 1)
平台
許可證MIT License
所有者活动
創建於2013-12-24 19:50:23
推送於2015-10-07 19:58:38
最后一次提交2015-10-07 12:58:37
發布數0
用户参与
星數226
關注者數8
派生數32
提交數14
已啟用問題?
問題數6
打開的問題數4
拉請求數1
打開的拉請求數0
關閉的拉請求數1
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?