kaggle-quora-question-pairs

Kaggle:Quora Question Pairs, 4th/3396 (https://www.kaggle.com/c/quora-question-pairs)

Github星跟蹤圖


Kaggle: Quora Question Pairs (Comming Soon)

Author: Liang Pang, Yixing Fan, Jianpeng Hou, Xinyu Yue, Guocheng Niu

Categories


Abstract

In the Quora Question Pairs Challenge, we were asked to build a model to classify whether question pairs are duplicates or not (multiple versions of the same question). Our final submission was a stacking result of multiple models. This submission scored 0.11450 on Public LB and 0.11768 on Private LB (with post-process), ranking 4 out of 3396 teams. This documents describes our team's solution which can be divided into diffrent parts: Pre-processing, Feature Engineering, Modeling and Post-processing.


Summary

Our solution consisted of four main parts: Pre-processing, Feature Engineering, Modeling and Post-processing. What's more, we developed a light weight Machine Learning framework FeatWheel to help us to finish ML jobs, such as feature extraction, feature merging and so on.

In pre-processing, we process the text of data with text cleaning, word stemming, removing stop words and shared words and can form different versions of original data. In feature engineering, we extracted features based on various versions of data. The features can be classified in to three categories:Statistical Features, NLP Features and Graph Features. In modeling, we build deep models, boosting models (using XGBoost, LightGBM) and linear models (Linear Regression) and build a multi-layer stacking system to ensemble different models together. As we all know, the distribution of the training data and test data are quite different, so we made post-processing on the prediction results. We cut the data into different parts according to the clique size and rescale the results in different parts.

Flowchart

The flowchart of our method is shown as follows:

flowchart

Submission

Submissions were evaluated on the log loss between the predicted values and the group truth. In specific, the best single model we have obtained during the competition was an XGBoost model with tree booster of Public LB score 0.12653 and Private LB score 0.13067 (without post-process). Our final submission was a stacking result of multiple models. This submission scored 0.11450 on Public LB and 0.11768 on Private LB (with post-process), ranking 4 out of 3396 teams.


Deep Model

Please see TextNet and TextNet-Model. For tensorflow version, please checkout MatchZoo.

主要指標

概覽
名稱與所有者jocicmarko/ultrasound-nerve-segmentation
主編程語言Python
編程語言Python (語言數: 1)
平台
許可證MIT License
所有者活动
創建於2016-06-01 10:48:32
推送於2018-10-11 17:58:13
最后一次提交2018-03-28 10:12:58
發布數0
用户参与
星數0.9k
關注者數59
派生數327
提交數13
已啟用問題?
問題數83
打開的問題數50
拉請求數2
打開的拉請求數3
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?