tianchi_OGeek

在搜索业务下有一个场景叫实时搜索（Instance Search）,就是在用户不断输入过程中，实时返回查询结果。此次赛题来自OPPO手机搜索排序优化的一个子场景，并做了相应的简化，意在解决query-title语义匹配的问题。简化后，本次题目内容主要为一个实时搜索场景下query-title的ctr预估问题。

0 分数

(1) A榜：0.7347
(2) B榜：0.7335
(3) 比赛网址：https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.11409106.5678.1.2c547b6fmKviKy&raceId=231688
(4) 数据下载地址：链接：https://pan.baidu.com/s/1JdjA0cMijrqTCWpaCGMaYA 提取码：vg19

1 baseline 共享网址

(1) 天池-OGeek算法挑战赛baseline(0.7016) https://zhuanlan.zhihu.com/p/46482521
(2) OGEEK算法挑战赛代码分享 https://zhuanlan.zhihu.com/p/46479794
(3) GrinAndBear/OGeek: https://github.com/GrinAndBear/OGeek
(4) flytoylf/OGeek 一个lgb和rnn的代码: https://github.com/flytoylf/OGeek
(5) https://github.com/search?q=OGeek
(6) https://github.com/search?q=tianchi_oppo
(7) https://github.com/luoling1993/TianChi_OGeek/stargazers

2 CTR 参考资料

(1) 推荐系统遇上深度学习: https://github.com/princewen/tensorflow_practice
(2) 推荐系统中使用ctr排序的f(x)的设计-dnn篇: https://github.com/nzc/dnn_ctr
(3) CTR预估算法之FM, FFM, DeepFM及实践: https://github.com/milkboylyf/CTR_Prediction
(4) MLR算法: https://wenku.baidu.com/view/b0e8976f2b160b4e767fcfdc.html

3 nlp 参考资料

(1) 用深度学习（CNN RNN Attention）解决大规模文本分类问题 - 综述和实践 https://zhuanlan.zhihu.com/p/25928551
(2) 知乎“看山杯” 夺冠记：https://zhuanlan.zhihu.com/p/28923961
(3) 2017知乎看山杯从入门到第二 https://zhuanlan.zhihu.com/p/29020616
(4) liuhuanyong https://github.com/liuhuanyong
(5) Chinese Word Vectors 中文词向量 https://github.com/Embedding/Chinese-Word-Vectors 注释：这个链接收藏语料库

4 其他比赛总结参考链接

(1) ML理论&实践 https://zhuanlan.zhihu.com/c_152307828?tdsourcetag=s_pctim_aiomsg

5 未整理思路

(1) 主线思路：CTR思路，围绕用户点击率做文章(如开源中：单字段点击率，组合字段点击率等等) (FM, FFM模型，参考腾讯社交广告比赛？？)
(2) 文本匹配思路（Kaggle Quora）传统特征：抽取文本相似度特征，各个字段之间的距离量化 https://www.kaggle.com/c/quora-question-pairs https://github.com/qqgeogor/kaggle-quora-solution-8th https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question
(3) 深度学习模型(1DCNN, Esim, Decomp Attention，ELMO等等)： https://www.kaggle.com/rethfro/1d-cnn-single-model-score-0-14-0-16-or-0-23/notebook https://www.kaggle.com/lamdang/dl-models/comments 更多文本匹配模型见斯坦福SNLI论文集：https://nlp.stanford.edu/projects/snli/
(4) 文本分类思想：主要是如何组织输入文本？另外query_prediction权重考虑？传统特征：tfidf，bow，ngram+tfidf，sent2vec，lsi，lda等特征
(5) 深度学习模型：参考知乎看山杯(知乎)以及Kaggle Toxic比赛

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52557
https://www.kaggle.com/larryfreeman/toxic-comments-code-for-alexander-s-9872-model/comments
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52702

(6) Stacking无效(模型个数限制)，简单Blending，NN+LightGBM的方案比较靠谱？
(7) PS1：词向量可使用word2vec训练或者使用公开词向量数据：https://github.com/Embedding/Chinese-Word-Vectors PS2：分词需要加上自定义词典，分词质量对模型训练很重要！

6 基本思考

(1)：如何选用一些泛化能力分类器 -> logistic regression; support vector machine; linear regression
(2)：如何构造文本特征 -> nlp分析
(3)：如何解决特征稀疏问题 -> deep-fm

Name With Owner	Paratron/hookrouter
Primary Language	JavaScript
Program language	Python (Language Count: 2)
Platform
License:

Created At	2019-02-12 23:45:15
Pushed At	2022-09-14 03:13:10
Last Commit At	2020-12-01 15:23:57
Release Count	6
Last Release Name	1.2.5 (Posted on )
First Release Name	v1.1.7 (Posted on 2019-04-26 23:36:14)

Stargazers Count	1.5k
Watchers Count	21
Fork Count	88
Commits Count	138
Has Issues Enabled
Issues Count	137
Issue Open Count	59
Pull Requests Count	10
Pull Requests Open Count	5
Pull Requests Close Count	17

Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

tianchi_OGeek

Github stars Tracking Chart

tianchi_OGeek

0 分数

1 baseline 共享网址

2 CTR 参考资料

3 nlp 参考资料

4 其他比赛总结参考链接

5 未整理思路

6 基本思考

Main metrics