triez

fast, efficient, unicode aware HAT trie with prefix / suffix support for Ruby

  • 所有者: luikore/triez
  • 平台:
  • 許可證: MIT License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

Triez

Build Status
Code Climate
Gem Version

Pragmatic tries for Ruby, spelled in lolcat.

It is fast, memory efficient, unicode aware, prefix searchable, and enchanced with prefix/suffix/substring keys.

The backend of triez is a cache oblivious data structure: the HAT trie (In fact it is a modified version for improved functionality). HAT trie is generally faster and more memory efficient than double array or burst trie.

Requirement

  • CRuby 1.9 / 2.0
  • g++ or clang

Install

gem ins triez

Synopsis

require 'triez'

# create triez with (default value type) int64, and setting default value 0
t = Triez.new

# The default value is 0
t['foo'] #=> 0

# available options for value_type:
# - :int64  -- signed 64 bit integer
# - :object -- ruby object, but note that:
#              if the object is beyond simple types like NilClass, TrueClass, Integer,
#              it will take additional heap space
t = Triez.new value_type: :int64

# more flexible with object type [*see note below]
t = Triez.new value_type: :object

# get the value type
t.value_type

# set a different default value
t = Triez.new value_type: :object, default: 'hello'

# insert or change value
t['key'] = 100

# insert a key with default value
t << 'key'

# batch change values under all suffices/prefices/substrings of a key
t.change_all(:suffix, 'key') {, old_value, ...calculate new value }
t.change_all(:prefix, 'key') {, old_value, ...calculate new value }
# enumerates all occurences of substrings of the key
t.change_all(:substring, 'key') {, old_value, ...calculate new value }

# size of inserted keys
t.size

# search with exact match
t.has_key? 'key'
t['key']

# prefixed search (iterate over values under a prefix), available options are:
# - limit: max items, `nil` means no limit
# - sort: whether iterate in alphabetic order, default is true
t.search_with_prefix(prefix, limit: 10, sort: true) do, suffix, value, ...
end

# if no block given, an array in the form of  is returned
t.search_with_prefix('prefix')

# enumerate all keys and values in the order of binary collation
t.each do, key, value, ...
end

# iterate stored keys which are prefices of a given string, from shallow to deep
t.walk string do, k, v, ...
end

* Note: By default, triez store signed integers within 64bits, you can use them as weights, counts or database IDs. In case you need to store arbitrary object in a node, use value_type: :object:

t = Triez.new value_type: :object
t['Tom'] = {name: 'Tom', sex: 'Female'}
t['Tree'] = [:leaf, :trunk, :root]

Examples

Prefix based autocompletion:

require 'triez'
words = %w[readme, rot, red, rah, rasterization]
t = Triez.new
words.each do, word, t[word] = 1
end
t.search_with_prefix 're' do, suffix, puts "candidate: re#{suffix}"
end

The output:

candidate: readme
candidate: red

Efficient full text search with a suffix tree:

require 'triez'
sequences = {
  'ACTGAAAAAAACTG' => 1,
  'ATACGGTCCA' => 2,
  'GCTTGTACGT' => 3
}
t = Triez.new

# build suffix tree
sequences.each do, seq, id, t.change_all(:suffix, seq){id}
end

t.search_with_prefix 'CGGT' do, _, id, puts id #=> 2
end

The searching time is linear to the length of the substring. You may also be interested in the example of a simple full text search server with triez.


Solve the longest common substring problem:

# coding: utf-8
require 'triez'
sentences = %w[
  万塘路一锅鸡
  去文二路一锅鸡吃饭
  来一锅鸡顶盒
  一锅鸡胗
]

# value is bitset representing id of the sentence
# in ruby we can use integers of arbitrary length as bitsets
t = Triez.new value_type: :object, default: 0

sentences.each_with_index do, sentence, i, elem = 1 << i
  t.change_all :substring, sentence do, v, # union
    v, elem
  end
end

# longest common substring
lcs = ''

# find the key tagged with universe
universe = (1 << sentences.size) - 1
t.each do, k, v, lcs = k if k.size > lcs.size and v == universe
end

puts lcs #=> 一锅鸡

Benchmark

Here's a benchmark on

ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.1]
2.3 GHz Intel Core i7

The test data are 3 milion titles of wikipedia articles (from http://dumps.wikimedia.org/enwiki/20121101/)

thing/backend, memory, insertion time, 3 M query
------------------------, ---------, ----------------, ----------
hash/linked hash, 340.2 M, 4.369 s, 0.2800 s
fast_trie/double array*, 155.6 M, 130.7 s, 0.4359 s
triez/HAT trie, 121.7 M, 3.872 s, 0.3472 s

Note: fast_trie/double array -> https://github.com/tyler/trie

Caveats

  • The sort option in prefixed search orders keys with binary collation, but string comparison in Ruby is with unicode codepoint collation.
  • For some rare case of many threads modifying the same trie, you may need a mutex.
  • If you still feel memory not enough, you may consider MARISA-trie (note that MARISA is immutable), or a database.

Development

git clone git://github.com/luikore/triez.git
cd triez
rake glob_src
rake

To update vendor lib and re-compile:

rake glob_src
rake

Note

Although HAT trie uses MurMurHash3 instead of SipHash in Ruby, It is still safe under hashDoS because bucket size is limited.

主要指標

概覽
名稱與所有者luikore/triez
主編程語言C++
編程語言Ruby (語言數: 3)
平台
許可證MIT License
所有者活动
創建於2013-02-04 00:23:42
推送於2018-03-29 02:58:38
最后一次提交2018-03-29 10:58:06
發布數0
用户参与
星數140
關注者數7
派生數9
提交數80
已啟用問題?
問題數8
打開的問題數5
拉請求數1
打開的拉請求數0
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?