kaggle-airbnb

:earth_africa: Where will a new guest book their first travel experience?

Github星跟蹤圖

Airbnb Kaggle Competition: New User Bookings

Code Health

This repository contains the code developed for the Airbnb's Kaggle
competition
. It's written in Python, some in the form
of Jupyter Notebooks, and other in pure Python 3.

The code produces predictions with scores around 0.88090% in the public
leader-board, enough to be in the best 5% participants(0.001% behind the best)
and 0.88509% in the private leader-board(0.0018% behind the winner)

The entire run should not take more than 4 hours(thanks to the parallel
preprocessing) in a modern/recent computer, though you may run into memory
issues with less than 8GB RAM.

Feel free to contribute to the code or open an issue if you see something wrong.

Description

New users on Airbnb can book a place to stay in 34,000+ cities across 190+
countries. By accurately predicting where a new user will book their first
travel experience, Airbnb can share more personalized content with their
community, decrease the average time to first booking, and better forecast
demand.

In this competition, the goal is to predict in which country a new user
will make his or her first booking. There are 12 possible outcomes of the
destination country and the datasets consist of a list of users with their
demographics, web session records, and some summary statistics.

Data

Due to the Competition Rules, the data sets can not be shared. If
you want to take a look at the data, head over the competition
page and download it.

You need to download train_users_2.csv, test_users.csv and sessions.csv
files and unzip them into the 'data' folder.

Note: Since the train users file is the one re-uploaded by the competition
administrators, rename train_users_2.csv as train_users.csv.

Main Ideas

  1. The provided datasets have lot of NaNs and some other random values, so, a
    good preprocessing is the primary key to get a good solution:

    • Replace -unknown- values with NaNs
    • Clean age values
    • Extract day, weekday, month, year from date_account_created
      and timestamp_first_active
    • Add number of missing values per user
    • General user session information:
      • Number of different values in action, action_type,
        action_detail and device_type
  2. That kind of classification task works nicely with tree-based methods, I
    used xgboost library and the Gradient Boosting Classifier that provides along
    scikit-learn to make the probabilities predictions.

Requirements

To replicate the findings and execute the code in this repository you will need
basically the next Python packages:

Resources

  • XGBoost Documentation - A library designed
    and optimized for boosted (tree) algorithms.
  • Pattern Classification -
    Tutorials, examples, collections, and everything else that falls into the
    categories: pattern classification, machine learning, and data mining.

License

Copyright © 2015 David Gasquez
Licensed under the MIT license.

主要指標

概覽
名稱與所有者davidgasquez/kaggle-airbnb
主編程語言Jupyter Notebook
編程語言Python (語言數: 2)
平台
許可證MIT License
所有者活动
創建於2015-12-09 19:45:55
推送於2017-05-04 07:22:14
最后一次提交2017-05-04 09:22:14
發布數0
用户参与
星數53
關注者數5
派生數26
提交數663
已啟用問題?
問題數11
打開的問題數0
拉請求數0
打開的拉請求數0
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?