TalkingData AdTracking Fraud Detection Challenge

models and scores

model definition can be found in scripts/model_lib.py

model1 LGBM with 83 (76 numerical, 7 categorical) features.
model2 keras with 27(18 numerical, 9 categorical) features, You can see network structure in model.png, model, private score, public score, ---, ---, ---, model1, 0.9836325, 0.9828896, model2, 0.9830595, 0.9822785, ## feature engineering and scripts
Most of these features have already been discussed on the kaggle forum.
counting features
- mk_feat_count.py
- mk_feat_count_time.py
- mk_feat_countRatio.py
cumulative count
- mk_feat_cumcount.py
- mk_feat_recumcount.py
- mk_feat_cumratio.py
time to next click
- mk_feat_nextClick_leak_day.py
- mk_feat_nextClick_filter.py
time bucket count.(make multiple time intervals, and count the number of buckets which the IP exists)
- mk_feat_rangecount.py
- mk_feat_rangecount_minute.py
variance
- mk_feat_var.py
common IP
- mk_feat_common_ip.py
unique count
- mk_feat_uniq_count2.py
target encoding: woe
- mk_feat_woe_all_prev.py
- mk_feat_woe_bound.py

Features will be calculated once and saved to disk.

Importance from LGBM is found in importance.txt.

I used following environment

Hardware:

Python3 packages:

At first, put sample_submission.csv test.csv test_supplement.csv train.csv to input directory.

Then run shell scripts as follows,

$ cd scripts/

$ ./run_mk_feats.sh

$ ./run_mk_model1.sh

$ ./run_mk_model2.sh

Output prediction files will be in csv directory.

It took about one day for feature extraction(run_mk_feats.sh).

It needs large memory(~256GB) to build model1(run_mk_model1.sh), sorry.

GPU is required to build model2(run_mk_model2.sh)