bigquery

HTTP Archive + BigQuery data import

Note: you don't need to import this data yourself, the BigQuery dataset is public! Getting started.

However, if you do want your own private copy of the dataset... The following import and sync scripts will help you import the HTTP Archive dataset into BigQuery and keep it up to date.

$> sh sync.sh Jun_15_2013
$> sh sync.sh mobile_Jun_15_2013

That's all there is to it. The sync script handles all the necessary processing:

Archives are fetched from archive.org (and cached locally)
Archived CSV is transformed to BigQuery compatible escaping
- You will need +pigz+ installed for parallel compression
Request files are split into <1GB compressed CSV's
Resulting pages and request data is synced to a Google Storage bucket
BigQuery import is kicked off for each of compressed archives on Google Storage

After the upload is complete, a copy of the latest tables can be made with:

$> bq.py cp runs.2013_06_15_pages runs.latest_pages
$> bq.py cp runs.2013_06_15_pages_mobile runs.latest_pages_mobile
$> bq.py cp runs.2013_06_15_requests runs.latest_requests
$> bq.py cp runs.2013_06_15_requests_mobiel runs.latest_requests_mobile

Name With Owner	HTTPArchive/bigquery
Primary Language	Jupyter Notebook
Program language	Ruby (Language Count: 6)
Platform
License:

Name With Owner

HTTPArchive/bigquery

Primary Language

Jupyter Notebook

Program language

Ruby (Language Count: 6)

Platform

License:

Created At	2013-06-04 08:58:31
Pushed At	2025-09-09 17:57:46
Last Commit At	2025-09-09 17:57:43
Release Count	0

Created At

2013-06-04 08:58:31

Pushed At

2025-09-09 17:57:46

Last Commit At

2025-09-09 17:57:43

Release Count

Stargazers Count	68
Watchers Count	11
Fork Count	24
Commits Count	510
Has Issues Enabled
Issues Count	43
Issue Open Count	2
Pull Requests Count	152
Pull Requests Open Count	2
Pull Requests Close Count	15

Stargazers Count

Watchers Count

Fork Count

Commits Count

510

Has Issues Enabled

Issues Count

Issue Open Count

Pull Requests Count

152

Pull Requests Open Count

Pull Requests Close Count

Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

Has Wiki Enabled

Is Archived

Is Fork

Is Locked

Is Mirror

Is Private

Github stars Tracking Chart

HTTP Archive + BigQuery data import

Main metrics