bigquery

HTTP Archive + BigQuery data import

Note: you don't need to import this data yourself, the BigQuery dataset is public! Getting started.

However, if you do want your own private copy of the dataset... The following import and sync scripts will help you import the HTTP Archive dataset into BigQuery and keep it up to date.

$> sh sync.sh Jun_15_2013
$> sh sync.sh mobile_Jun_15_2013

That's all there is to it. The sync script handles all the necessary processing:

Archives are fetched from archive.org (and cached locally)
Archived CSV is transformed to BigQuery compatible escaping
- You will need +pigz+ installed for parallel compression
Request files are split into <1GB compressed CSV's
Resulting pages and request data is synced to a Google Storage bucket
BigQuery import is kicked off for each of compressed archives on Google Storage

After the upload is complete, a copy of the latest tables can be made with:

$> bq.py cp runs.2013_06_15_pages runs.latest_pages
$> bq.py cp runs.2013_06_15_pages_mobile runs.latest_pages_mobile
$> bq.py cp runs.2013_06_15_requests runs.latest_requests
$> bq.py cp runs.2013_06_15_requests_mobiel runs.latest_requests_mobile

名称与所有者	HTTPArchive/bigquery
主编程语言	Jupyter Notebook
编程语言	Ruby (语言数: 6)
平台
许可证

名称与所有者

HTTPArchive/bigquery

主编程语言

Jupyter Notebook

编程语言

Ruby (语言数: 6)

平台

许可证

创建于	2013-06-04 08:58:31
推送于	2025-09-09 17:57:46
最后一次提交	2025-09-09 17:57:43
发布数	0

创建于

2013-06-04 08:58:31

推送于

2025-09-09 17:57:46

最后一次提交

2025-09-09 17:57:43

发布数

星数	68
关注者数	11
派生数	24
提交数	510
已启用问题?
问题数	43
打开的问题数	2
拉请求数	152
打开的拉请求数	2
关闭的拉请求数	15

星数

关注者数

派生数

提交数

510

已启用问题?

问题数

打开的问题数

拉请求数

152

打开的拉请求数

关闭的拉请求数

已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?

已启用Wiki?

已存档?

是复刻?

已锁定?

是镜像?

是私有?

Github星跟踪图

HTTP Archive + BigQuery data import

主要指标