building-machine-learning-pipelines

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson

  • 所有者: Building-ML-Pipelines/building-machine-learning-pipelines
  • 平台:
  • 许可证: MIT License
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

Building Machine Learning Pipelines

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson

Set up the demo project

Download the initial dataset. From the root of this repository, execute

python3 utils/download_dataset.py

After this script runs, you should have a data folder containing the file consumer_complaints_with_narrative.csv.

The dataset

The data that we use in this example project can be downloaded using the script above. The dataset is from a public dataset on customer complaints collected from the US Consumer Finance Protection Bureau. If you would like to reproduce our edited dataset, carry out the following steps:

  • Download the dataset from https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data
  • Rename the columns to [ "product", "sub_product", "issue", "sub_issue", "consumer_complaint_narrative", "company", "state", "zip_code", "company", "company_response", "timely_response", "consumer_disputed"]
  • Filter the dataset to remove rows with missing data in the consumer_complaint_narrative column
  • In the consumer_disputed column, map Yes to 1 and No to 0

Pre-pipeline experiment

Before building our TFX pipeline, we experimented with different feature engineering and model architectures. The notebooks in this folder preserve our experiments, and we then refactored our code into the interactive pipeline below.

Interactive pipeline

The interactive-pipeline folder contains a full interactive TFX pipeline for the consumer complaint data.

Full pipelines with Apache Beam, Apache Airflow, Kubeflow Pipelines, GCP

The pipelines folder contains complete pipelines for the various orchestrators. See Chapters 11 and 12 for full details.

Chapters

The following subfolders contain stand-alone code for individual chapters.

Model analysis

Chapter 7. Stand-alone code for TFMA, Fairness Indicators, What-If Tool. Note that these notebooks will not work in JupyterLab.

Advanced TFX

Chapter 10. Notebook outlining the implementation of custom TFX components from scratch and by inheriting existing functionality. Presented at the Apache Beam Summit 2020.

Data privacy

Chapter 14. Code for training a differentially private version of the demo project. Note that the TF-Privacy module only supports TF 1.x as of June 2020.

Version notes

The code was written and tested for version 0.22.

  • As of 9/22/20, the interactive pipeline runs on TFX version 0.24.0rc1.
    Due to tiny TFX bugs, the pipelines currently don't work on the releases 0.23 and 0.24-rc0. Github issues have been filed with the TFX team specifically for the book pipelines (Issue 2500). We will update the repository once the issue is resolved.

  • As of 9/14/20, TFX only supports Python 3.8 with version >0.24.0rc0.

主要指标

概览
名称与所有者Building-ML-Pipelines/building-machine-learning-pipelines
主编程语言Jupyter Notebook
编程语言Python (语言数: 5)
平台
许可证MIT License
所有者活动
创建于2020-05-12 23:43:23
推送于2023-03-25 01:19:28
最后一次提交
发布数2
最新版本名称examples_based_on_tfx_1.4 (发布于 )
第一版名称examples_based_on_tfx_0.22 (发布于 )
用户参与
星数595
关注者数18
派生数253
提交数139
已启用问题?
问题数43
打开的问题数8
拉请求数13
打开的拉请求数1
关闭的拉请求数15
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?