bigquery

BigQuery import and processing pipelines

  • Owner: HTTPArchive/bigquery
  • Platform:
  • License::
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

HTTP Archive + BigQuery data import

Note: you don't need to import this data yourself, the BigQuery dataset is public! Getting started.

However, if you do want your own private copy of the dataset... The following import and sync scripts will help you import the HTTP Archive dataset into BigQuery and keep it up to date.

$> sh sync.sh Jun_15_2013
$> sh sync.sh mobile_Jun_15_2013

That's all there is to it. The sync script handles all the necessary processing:

  • Archives are fetched from archive.org (and cached locally)
  • Archived CSV is transformed to BigQuery compatible escaping
    • You will need +pigz+ installed for parallel compression
  • Request files are split into <1GB compressed CSV's
  • Resulting pages and request data is synced to a Google Storage bucket
  • BigQuery import is kicked off for each of compressed archives on Google Storage

After the upload is complete, a copy of the latest tables can be made with:

$> bq.py cp runs.2013_06_15_pages runs.latest_pages
$> bq.py cp runs.2013_06_15_pages_mobile runs.latest_pages_mobile
$> bq.py cp runs.2013_06_15_requests runs.latest_requests
$> bq.py cp runs.2013_06_15_requests_mobiel runs.latest_requests_mobile

(MIT License) - Copyright (c) 2013 Ilya Grigorik

Overview

Name With OwnerHTTPArchive/bigquery
Primary LanguageJupyter Notebook
Program languageRuby (Language Count: 6)
Platform
License:
Release Count0
Created At2013-06-04 00:58:31
Pushed At2024-04-23 10:49:26
Last Commit At2024-04-14 14:36:29
Stargazers Count65
Watchers Count14
Fork Count19
Commits Count444
Has Issues Enabled
Issues Count44
Issue Open Count11
Pull Requests Count136
Pull Requests Open Count1
Pull Requests Close Count8
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private
To the top