特性

从各种来源和汇入和汇出数据湖的数据的摄取和导出。Gobblin 针对 ELT 模式进行了优化和设计，并在摄取（小 t）时进行内联转换。

湖中的数据组织（如压缩、分区、重复数据删除）。

湖泊内数据的生命周期管理（如数据的保存）

对整个生态系统的数据进行合规性管理（如细粒度的数据删除）。

亮点

规模化的战斗测试。在 LinkedIn、PayPal、Verizon 等公司的生产中以 PB 规模运行。

功能丰富。支持任务分区、增量处理的状态管理、原子数据发布、数据质量检查、任务调度、容错等。

支持流和批执行模式

控制平面（Gobblin-as-a-service）支持数据平面操作的程序化触发和协调。

生产中使用的常见模式

Kafka的数据流/批量摄取到数据湖(HDFS, S3, ADLS)

从数据湖中批量加载服务存储（如HDFS -> Couchbase）。

支持跨联邦数据湖的数据同步（HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)

将外部供应商的 API-s（如 Salesforce、Dynamics 等）与数据存储（HDFS、Couchbase 等）集成。

在 HDFS/ADLS 上执行数据保留政策和 GDPR 删除。

Apache Gobblin 不是

像 Spark 或 Flink 这样的通用数据转换引擎。Gobblin 可以将复杂的数据处理任务委托给 Spark、Hive 等。

一个数据存储系统，如 Apache Kafka 或 HDFS。Gobblin 作为源或汇与这些系统集成。

一个通用的工作流执行系统，如 Airflow、Azkaban、Dagster、Luigi。

构建发布的说明

解压归档文件到你的本地目录。

跳过测试，构建发行版。运行 ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain 分布将在 build/gobblin-distribution/distributions 目录下创建。(或)

运行测试并构建发行版（需要 Maven）。运行./gradlew build 该发行版将在build/gobblin-distribution/distributions 目录下创建。

Name With Owner	apache/gobblin
Primary Language	Java
Program language	Shell (Language Count: 10)
Platform	Docker, Linux, Mac, Windows
License:	Apache License 2.0

Name With Owner

apache/gobblin

Primary Language

Java

Program language

Shell (Language Count: 10)

Platform

Docker, Linux, Mac, Windows

License:

Apache License 2.0

Created At	2014-12-01 18:10:50
Pushed At	2025-07-28 13:25:10
Last Commit At	2025-07-28 18:55:10
Release Count	26
Last Release Name	release-0.17.0 (Posted on 2023-08-30 20:48:15)
First Release Name	gobblin_0.5.0 (Posted on )

Created At

2014-12-01 18:10:50

Pushed At

2025-07-28 13:25:10

Last Commit At

2025-07-28 18:55:10

Release Count

Last Release Name

release-0.17.0 (Posted on 2023-08-30 20:48:15)

First Release Name

gobblin_0.5.0 (Posted on )

Stargazers Count	2.2k
Watchers Count	158
Fork Count	750
Commits Count	6.5k
Has Issues Enabled
Issues Count	0
Issue Open Count	0
Pull Requests Count	2270
Pull Requests Open Count	129
Pull Requests Close Count	1419

Stargazers Count

2.2k

Watchers Count

158

Fork Count

750

Commits Count

6.5k

Has Issues Enabled

Issues Count

Issue Open Count

Pull Requests Count

2270

Pull Requests Open Count

129

Pull Requests Close Count

1419

Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

Has Wiki Enabled

Is Archived

Is Fork

Is Locked

Is Mirror

Is Private

Apache Gobblin

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
Data Organization within the lake (e.g. compaction, partitioning, deduplication)
Lifecycle Management of data within the lake (e.g. data retention)
Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
Supports stream and batch execution modes
Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

Java >= 1.8

If building the distribution with tests turned on:

Maven version 3.5.3

Instructions to download gradle wrapper

If you are going to build Gobblin from the source distribution,
run the following command for downloading the gradle-wrapper.jar from Gobblin git repository to gradle/wrapper directory
(replace GOBBLIN_VERSION in the URL with the version you downloaded).

wget --no-check-certificate -P gradle/wrapper https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

(or)

curl --insecure -L https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar > gradle/wrapper/gradle-wrapper.jar

Alternatively, you can download it manually from:
https://github.com/apache/gobblin/blob/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

Make sure that you download it to gradle/wrapper directory.

Instructions to run Apache RAT (Release Audit Tool)

Extract the archive file to your local directory.
Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

Extract the archive file to your local directory.
Skip tests and build the distribution:
Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain
The distribution will be created in build/gobblin-distribution/distributions directory.
(or)
Run tests and build the distribution (requires Maven):
Run ./gradlew build
The distribution will be created in build/gobblin-distribution/distributions directory.

Apache Gobblin

Github stars Tracking Chart

Apache Gobblin

特性

亮点

生产中使用的常见模式

Apache Gobblin 不是

要求

运行 Apache RAT（发布审核工具）的说明

构建发布的说明

快速链接

Main metrics

Apache Gobblin

Capabilities

Highlights

Common Patterns used in production

Apache Gobblin is NOT

Requirements

Instructions to download gradle wrapper

Instructions to run Apache RAT (Release Audit Tool)

Instructions to build the distribution

Quick Links