bigquery

HTTP Archive + BigQuery data import

Note: you don't need to import this data yourself, the BigQuery dataset is public! Getting started.

However, if you do want your own private copy of the dataset... The following import and sync scripts will help you import the HTTP Archive dataset into BigQuery and keep it up to date.

$> sh sync.sh Jun_15_2013
$> sh sync.sh mobile_Jun_15_2013

That's all there is to it. The sync script handles all the necessary processing:

Archives are fetched from archive.org (and cached locally)
Archived CSV is transformed to BigQuery compatible escaping
- You will need +pigz+ installed for parallel compression
Request files are split into <1GB compressed CSV's
Resulting pages and request data is synced to a Google Storage bucket
BigQuery import is kicked off for each of compressed archives on Google Storage

After the upload is complete, a copy of the latest tables can be made with:

$> bq.py cp runs.2013_06_15_pages runs.latest_pages
$> bq.py cp runs.2013_06_15_pages_mobile runs.latest_pages_mobile
$> bq.py cp runs.2013_06_15_requests runs.latest_requests
$> bq.py cp runs.2013_06_15_requests_mobiel runs.latest_requests_mobile

名稱與所有者	HTTPArchive/bigquery
主編程語言	Jupyter Notebook
編程語言	Ruby (語言數: 6)
平台
許可證

名稱與所有者

HTTPArchive/bigquery

主編程語言

Jupyter Notebook

編程語言

Ruby (語言數: 6)

平台

許可證

創建於	2013-06-04 08:58:31
推送於	2025-09-09 17:57:46
最后一次提交	2025-09-09 17:57:43
發布數	0

創建於

2013-06-04 08:58:31

推送於

2025-09-09 17:57:46

最后一次提交

2025-09-09 17:57:43

發布數

星數	68
關注者數	11
派生數	24
提交數	510
已啟用問題?
問題數	43
打開的問題數	2
拉請求數	152
打開的拉請求數	2
關閉的拉請求數	15

星數

關注者數

派生數

提交數

510

已啟用問題?

問題數

打開的問題數

拉請求數

152

打開的拉請求數

關閉的拉請求數

已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?

已啟用Wiki?

已存檔?

是復刻?

已鎖定?

是鏡像?

是私有?

Github星跟蹤圖

HTTP Archive + BigQuery data import

主要指標