TalkingData

TalkingData AdTracking Fraud Detection Challenge

  • Owner: CuteChibiko/TalkingData
  • Platform:
  • License:: Apache License 2.0
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

TalkingData AdTracking Fraud Detection Challenge

models and scores

model definition can be found in scripts/model_lib.py

  • model1 LGBM with 83 (76 numerical, 7 categorical) features.

  • model2 keras with 27(18 numerical, 9 categorical) features, You can see network structure in model.png, model, private score, public score, ---, ---, ---, model1, 0.9836325, 0.9828896, model2, 0.9830595, 0.9822785, ## feature engineering and scripts
    Most of these features have already been discussed on the kaggle forum.

  • counting features

    • mk_feat_count.py
    • mk_feat_count_time.py
    • mk_feat_countRatio.py
  • cumulative count

    • mk_feat_cumcount.py
    • mk_feat_recumcount.py
    • mk_feat_cumratio.py
  • time to next click

    • mk_feat_nextClick_leak_day.py
    • mk_feat_nextClick_filter.py
  • time bucket count.(make multiple time intervals, and count the number of buckets which the IP exists)

    • mk_feat_rangecount.py
    • mk_feat_rangecount_minute.py
  • variance

    • mk_feat_var.py
  • common IP

    • mk_feat_common_ip.py
  • unique count

    • mk_feat_uniq_count2.py
  • target encoding: woe

    • mk_feat_woe_all_prev.py
    • mk_feat_woe_bound.py

Features will be calculated once and saved to disk.

Importance from LGBM is found in importance.txt.

Requirements

I used following environment

Hardware:

  • Memory: 256GB RAM, 256GB SWAP
  • CPU: 20 core, 2.10GHz
  • GPU: 1080Ti

Python3 packages:

  • numpy==1.14.2
  • pandas==0.22.0
  • lightgbm==2.1.0
  • keras==2.1.5

How to run

At first, put sample_submission.csv test.csv test_supplement.csv train.csv to input directory.

Then run shell scripts as follows,

$ cd scripts/

$ ./run_mk_feats.sh

$ ./run_mk_model1.sh

$ ./run_mk_model2.sh

Output prediction files will be in csv directory.

It took about one day for feature extraction(run_mk_feats.sh).

It needs large memory(~256GB) to build model1(run_mk_model1.sh), sorry.

GPU is required to build model2(run_mk_model2.sh)

Main metrics

Overview
Name With OwnerCuteChibiko/TalkingData
Primary LanguagePython
Program languagePython (Language Count: 2)
Platform
License:Apache License 2.0
所有者活动
Created At2018-05-09 15:17:45
Pushed At2018-05-11 01:32:27
Last Commit At2018-05-11 10:32:26
Release Count0
用户参与
Stargazers Count104
Watchers Count1
Fork Count41
Commits Count6
Has Issues Enabled
Issues Count1
Issue Open Count1
Pull Requests Count0
Pull Requests Open Count0
Pull Requests Close Count0
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private