recommendify

使用协同过滤生成推荐。「Generate recommendations using collaborative filtering」

Github stars Tracking Chart

recommendify(推荐)

Recommendify 是一个基于 ruby/redis 的推荐引擎--推荐可以在多个主机上被增量更新/处理。工作器是用纯 ruby 和本地 C 语言实现的。

用例

  • "购买此产品的用户也购买了......" 来自于 user_id--bought-->product_id 对
  • "观看此视频的用户也观看了......" 来自于 user_id--viewed-->video_id 对
  • "喜欢这个场地的用户也喜欢..." 来自 user_id--likes-->venue_id 对

简介

你的输入数据(所谓的互动集)应该是这样的:

# 格式A:用户购买了产品(select buyerid, productid from sales group_by buyerid)
[user23] product5 produt42 product17
[user42] product8 produt16 product5

# # 格式B:用户观看了视频(这可以用 map/reduce 转换为上层表示)
user3 -> video3
user6 -> video19
user3 -> video6
user1 -> video42

输出的数据将看起来像这样:

# 基于共同购买的类似产品
product5 => product17 (0.78), product8 (0.43), product42 (0.31)
product17 => product5 (0.36), product8 (0.21), product42 (0.18)

# 基于共同意见的类似视频
video19 => video3 (0.93), video6 (0.56), video42 (0.34)
video42 => video19 (0.32), video3 (0.21), video6 (0.08)

你可以逐步向处理器添加新的交互集,但在添加新的交互后,必须重新处理已改变的项目的相似性。你可以不时地重新处理所有的项目(recommender.process!),或者跟踪更新,只处理变化的项目(recommender.process_item!)

使用方法

# 我们的相似性矩阵,我们通过 "orders" 中产品的共同发生来计算相似性,使用
# jaccard 相似性测量。
class MyRecommender < Recommendify::Base

  # 每个项目只存储前50个邻居的信息
  max_neighbors 50

  # 定义一个输入数据集 "order_items"。我们将把 "order_id->product_id"
  # 对添加到这个输入中,并使用 jaccard 系数来检索订购了 item i1 的客户
  # 也订购了 item i2 的语句,并将结果应用于 item<->item 相似性矩阵,
  # 权重为5.0。
  input_matrix :order_items,  
    # :native => true,
    :similarity_func => :jaccard,    
    :weight => 5.0

end

recommender = MyRecommender.new

# 在 order_item_sim 输入中添加 `order_id->product_id` 交互,
# 你可以逐步添加数据,并随时调用 RecommendedItem.process! 
# 来更新相似性矩阵。
recommender.order_items.add_set("order1", ["product23", "product65", "productm23"])
recommender.order_items.add_set("order2", ["product14", "product23"])

# 计算相似性矩阵的所有元素
recommender.process!

# ...或者计算相似度矩阵的特定行(特定项目),
# 用这个来避免增量更新后重新处理整个矩阵。
recommender.process_item!("product65")

# 检索与 "product23" 相似的产品
recommender.for("item23") 
  => [ <Recommendify::Neighbor item_id:"product65" similarity:0.23>, (...) ]

# 从相似度矩阵和输入矩阵中删除 "product23"。如果你的项目 "过期",
# 你应该这样做,因为这将加快计算速度。
recommender.delete_item!("product23") 

工作原理

Recommendify 保留一个增量更新的 item x item 矩阵,即 "co-concurrency matrix 共存矩阵"。这个矩阵存储了两个项目的组合在互动/偏好集中出现的次数。用 jaccard 相似度测量法处理共存数,以检索另一个 item x item 相似度矩阵,用来寻找每个项目的 N 个最相似项目。这也被称为 "基于项目的二进制评分的协同过滤"(见 Miranda, Alipio 等人[1])。

将输入的 user->->item 对按 user-id 分组,并将其存储到交互集中

对于交互集中的每个 item<->item 组合,增加共现矩阵中的相应元素

对于共同发生矩阵中的每个 item<->item 组合,计算 item<->item 的相似度

对于每个 item,在各自的输出集中存储 N 个最相似的项目。

它是可伸缩的吗?

共现矩阵和相似性矩阵的最大条目数是 k(n)=(n^2)-(n/2),它的增长是 O(n^2)。然而,在真实场景中,所有 item<->item 组合出现在交互集中的可能性非常小,我们使用一个稀疏矩阵,它只对 value>0 的元素使用内存,相似度的大小增长为 O(n)。

本地/快速 worker

在你编译了本地 worker 后,你可以将 :native => true 选项传递给 input_matrix。这将使处理速度至少提高 10 倍。

cd ~/.rvm/gems/ruby-1.9.3-p0/gems/recommendify-0.2.2/
bundle exec rake build_native

例子

这些推荐是从2,3mb的 "profile visit"-data(取自 www.talentsuche.de)计算出来的 -- 请记住,推荐器只使用 visitor->visited 的数据,它不知道用户的性别。

(Example Results 图恕删略,请参见自述文件)。

完整片段:http://falbala.23loc.com/~paul/recommendify_out_1.html

最初处理 120.047 个 visitor_id->profile_id 对,目前在单核上,只用 ruby 实现需要半小时左右,用 native/c 实现需要 130 秒。它在 redis 中创建了一个 24.1mb 的 hashable(有截断的 user_rows a' max 100 items)。在另一个具有非常短的用户行的真实数据集(购买/支付数据)中,它对 9 万个项目只用了 3.4MB,结果非常好。你可以自己试试;完整的数据和代码在 doc/example.rb 和 doc/example_data.csv。

来源/参考文献

[1] Miranda C. and Alipio J. (2008). 二元评级的增量协作排序(LIAAD - INESC Porto, University of Porto)

[2] George Karypis(2000) 基于项目的Top-N推荐算法评估(明尼苏达大学计算机科学系陆军高性能计算研究中心)

[3] Shiwei Z., Junjie W. Hui X. 和 Guoping X. (2011) 基于top-K余弦相似度的缩放(数据与知识工程70)

许可证

Copyright (c) 2011 Paul Asmuth

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to use, copy and modify copies of the Software, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Overview

Name With Ownerasmuth/recommendify
Primary LanguageRuby
Program languageC (Language Count: 2)
PlatformLinux, Mac, Windows
License:
Release Count0
Created At2012-01-29 03:56:10
Pushed At2014-05-08 21:18:47
Last Commit At2012-03-20 13:43:21
Stargazers Count1.7k
Watchers Count72
Fork Count146
Commits Count161
Has Issues Enabled
Issues Count17
Issue Open Count12
Pull Requests Count0
Pull Requests Open Count6
Pull Requests Close Count8
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

recommendify

Recommendify is a ruby/redis based recommendation engine - The recommendations can be updated/processed incrementally and on multiple hosts. The worker is implemented in plain ruby and native C.

Build status - Travis-ci


usecases

  • "Users that bought this product also bought..." from user_id--bought-->product_id pairs
  • "Users that viewed this video also viewed..." from user_id--viewed-->video_id pairs
  • "Users that like this venue also like..." from user_id--likes-->venue_id pairs

synopsis

Your input data (the so called interaction-sets) should look like this:

# FORMAT A: user bought products (select buyerid, productid from sales group_by buyerid)
[user23] product5 produt42 product17
[user42] product8 produt16 product5

# FORMAT B: user watched video (this can be transformed to the upper representation with a map/reduce)
user3 -> video3
user6 -> video19
user3 -> video6
user1 -> video42

The output data will look like this:

# similar products based on co-concurrent buys
product5 => product17 (0.78), product8 (0.43), product42 (0.31)
product17 => product5 (0.36), product8 (0.21), product42 (0.18)

# similar videos based on co-concurrent views
video19 => video3 (0.93), video6 (0.56), video42 (0.34)
video42 => video19 (0.32), video3 (0.21), video6 (0.08)

You can add new interaction-sets to the processor incrementally, but the similarities for changed items have to be re-processed after new interactions were added. You can either re-process all items (recommender.process!) from time to time or keep track of the updates and only process the changed items (recommender.process_item!)

usage


# Our similarity matrix, we calculate the similarity via co-concurrence 
# of products in "orders" using the jaccard similarity measure.
class MyRecommender < Recommendify::Base

  # store only the top fifty neighbors per item
  max_neighbors 50

  # define an input data set "order_items". we'll add "order_id->product_id"
  # pairs to this input and use the jaccard coefficient to retrieve a 
  # "customers that ordered item i1 also ordered item i2" statement and apply
  # the result to the item<->item similarity matrix with a weight of 5.0
  input_matrix :order_items,  
    # :native => true,
    :similarity_func => :jaccard,    
    :weight => 5.0

end

recommender = MyRecommender.new

# add `order_id->product_id` interactions to the order_item_sim input
# you can add data incrementally and call RecommendedItem.process! to update
# the similarity matrix at any time.
recommender.order_items.add_set("order1", ["product23", "product65", "productm23"])
recommender.order_items.add_set("order2", ["product14", "product23"])

# Calculate all elements of the similarity matrix
recommender.process!

# ...or calculate a specific row of the similarity matrix (a specific item)
# use this to avoid re-processing the whole matrix after incremental updates
recommender.process_item!("product65")

# retrieve similar products to "product23"
recommender.for("item23") 
  => [ <Recommendify::Neighbor item_id:"product65" similarity:0.23>, (...) ]

# remove "product23" from the similarity matrix and the input matrices. you should 
# do this if your items 'expire', since it will speed up the calculation
recommender.delete_item!("product23") 

how it works

Recommendify keeps an incrementally updated item x item matrix, the "co-concurrency matrix". This matrix stores the number of times that a combination of two items has appeared in an interaction/preferrence set. The co-concurrence counts are processed with a jaccard similarity measure to retrieve another item x item similarity matrix, which is used to find the N most similar items for each item. This is also called "Item-based Collaborative Filtering with binary ratings" (see Miranda, Alipio et al. [1])

  1. Group the input user->item pairs by user-id and store them into interaction sets
  2. For each item<->item combination in the interaction set increment the respective element in the co-concurrence matrix
  3. For each item<->item combination in the co-concurrence matrix calculate the item<->item similarity
  4. For each item store the N most similar items in the respective output set.

does it scale?

The maximum number of entries in the co-concurrence and similarity matrix is k(n) = (n^2)-(n/2), it grows O(n^2). However, in a real scenario it is very unlikely that all item<->item combinations appear in a interaction set and we use a sparse matrix which will only use memory for elemtens with a value > 0. The size of the similarity grows O(n).

native/fast worker

After you have compiled the native worker, you can pass the :native => true option to the input_matrix. This speeds up processing by at least 10x.

cd ~/.rvm/gems/ruby-1.9.3-p0/gems/recommendify-0.2.2/
bundle exec rake build_native

example

These recommendations were calculated from 2,3mb "profile visit"-data (taken from www.talentsuche.de) - keep in mind that the recommender uses only visitor->visited data, it doesn't know the gender of a user.

Example Results

full snippet: http://falbala.23loc.com/~paul/recommendify_out_1.html

Initially processing the 120.047 visitor_id->profile_id pairs currently takes around half an hour with the ruby-only implementation and ~130 seconds with the native/c implementation on a single core. It creates a 24.1mb hashtable in redis (with truncated user_rows a' max 100 items). In another real data set with very short user rows (purchase/payment data) it used only 3.4mb for 90k items with very good results. You can try this for yourself; the complete data and code is in doc/example.rb and doc/example_data.csv.

Sources / References

[1] Miranda C. and Alipio J. (2008). Incremental collaborative filtering for binary ratings (LIAAD - INESC Porto, University of Porto)

[2] George Karypis (2000) Evaluation of Item-Based Top-N Recommendation Algorithms (University of Minnesota, Department of Computer Science / Army HPC Research Center)

[3] Shiwei Z., Junjie W. Hui X. and Guoping X. (2011) Scaling up top-K cosine similarity search (Data & Knowledge Engineering 70)

License

Copyright (c) 2011 Paul Asmuth

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to use, copy and modify copies of the Software, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

To the top