Pachyderm

大规模可再生数据科学!(Reproducible Data Science at Scale!)

Github stars Tracking Chart

Pachyderm:数据版本控制,数据管道和数据沿袭

Pachyderm 是生产数据管道的工具。如果您需要以一种理智的方式将数据抓取、摄取、清理、修改、纠缠、处理、建模和分析联系在一起,那么 Pachyderm 就适合您。如果您有一组以特殊方式完成此任务的脚本,并且您正在寻找一种方法来制作它们,那么 Pachyderm 可以让您轻松完成此任务。

功能

  • Containerized:Pachyderm 建立在 Docker 和 Kubernetes 之上。 无论您的管道需要什么语言或库,它们都可以在 Pachyderm 上运行,而 Pachyderm 可以轻松地部署在任何云提供商或 Prem 上。
  • 版本控制:Pachyderm 版本控制您处理的数据。您可以随时查询系统数据的变化情况, 查看差异,如果看起来不正确,请还原。
  • Provenance(又名数据沿袭):Pachyderm 跟踪数据的来源。 Pachyderm 会跟踪创建结果的所有代码和数据。
  • 并行化:Pachyderm 可以高效地调度大规模并行工作负载。
  • 增量处理:Pachyderm 理解您的数据如何变化,并且足够聪明,只处理新数据。

入门

在本地安装Pachyderm 或在大约5分钟内部署在AWS/GCE/Azure上

您也可以参阅我们完整的开发人员文档以查看教程,查看示例项目并了解 Pachyderm 的高级功能。

如果您想查看一些示例并了解 Pachyderm 的核心用例:

文档

官方文件

社区

保持最新状态并通过以下方式获得 Pachyderm 支持:

  • Twitter 上关注我们。
  • 加入我们的社区 Slack Channel 可以从 Pachyderm 团队和其他用户那里获得帮助。

贡献

要开始使用,请签署贡献者许可协议

您还应该查看我们的贡献指南

给我们发送 PRs,我们很乐意看到你做什么!您还可以查看我们的 GH 问题,以查找标记为“需要帮助”(”help-wanted“)的内容作为开始的好地方。 我们有时不善于保持标签的更新,因此,如果你没有看到任何更新,请告诉我们。

加入我们

我们正在招聘!爱 Docker、Go 和分布式系统?详细了解我们的团队并发送电子邮件至 jobs@pachyderm.io。

使用量度

Pachyderm 自动报告匿名使用量度。这些量度帮助我们了解人们如何使用Pachyderm 并使其更好。他们可以通过在 pachd 容器中设置 env 变量 METRICS 为 false 来禁用 。

Main metrics

Overview
Name With Ownerpachyderm/pachyderm
Primary LanguageGo
Program languageGo (Language Count: 15)
PlatformLinux, Mac, Docker, Kubernetes, Amazon AWS, Google Cloud Platform, Microsoft Azure, Windows
License:Apache License 2.0
所有者活动
Created At2014-09-04 07:50:02
Pushed At2025-02-03 22:27:18
Last Commit At2025-01-09 14:41:45
Release Count1491
Last Release Namev2.12.2 (Posted on 2025-01-15 15:37:26)
First Release Namev0.1 (Posted on )
用户参与
Stargazers Count6.3k
Watchers Count156
Fork Count566
Commits Count22.8k
Has Issues Enabled
Issues Count3085
Issue Open Count705
Pull Requests Count6324
Pull Requests Open Count230
Pull Requests Close Count837
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

GitHub release
GitHub license
GoDoc
Go Report Card
Slack Status
CLA assistant

Pachyderm: Data Versioning, Data Pipelines, and Data Lineage

Pachyderm is a tool for production data pipelines. If you need to chain
together data scraping, ingestion, cleaning, munging, wrangling, processing,
modeling, and analysis in a sane way, then Pachyderm is for you. If you have an
existing set of scripts which do this in an ad-hoc fashion and you're looking
for a way to "productionize" them, Pachyderm can make this easy for you.

Features

  • Containerized: Pachyderm is built on Docker and Kubernetes. Whatever
    languages or libraries your pipeline needs, they can run on Pachyderm which
    can easily be deployed on any cloud provider or on prem.
  • Version Control: Pachyderm version controls your data as it's processed. You
    can always ask the system how data has changed, see a diff, and, if something
    doesn't look right, revert.
  • Provenance (aka data lineage): Pachyderm tracks where data comes from. Pachyderm keeps track of all the code and data that created a result.
  • Parallelization: Pachyderm can efficiently schedule massively parallel
    workloads.
  • Incremental Processing: Pachyderm understands how your data has changed and
    is smart enough to only process the new data.

Getting Started

Install Pachyderm locally or deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:

Documentation

Official Documentation

Community

Keep up to date and get Pachyderm support via:

  • Twitter Follow us on Twitter.
  • Slack Status Join our community Slack Channel to get help from the Pachyderm team and other users.

Contributing

To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Join Us

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our open positions or email us at jobs@pachyderm.io.

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us
understand how people are using Pachyderm and make it better. They can be
disabled by setting the env variable METRICS to false in the pachd
container.