Apache Gobblin

一个分布式数据集成框架,简化了大数据集成的常见方面,如流式和批处理数据生态系统的数据摄取、复制、组织和生命周期管理。「A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.」

Github stars Tracking Chart

Apache Gobblin

Apache Gobblin 是一个高度可扩展的数据管理解决方案,适用于异构数据生态系统中的结构化和面向字节的数据。

特性

  • 从各种来源和汇入和汇出数据湖的数据的摄取和导出。Gobblin 针对 ELT 模式进行了优化和设计,并在摄取(小 t)时进行内联转换。
  • 湖中的数据组织(如压缩、分区、重复数据删除)。
  • 湖泊内数据的生命周期管理(如数据的保存)
  • 对整个生态系统的数据进行合规性管理(如细粒度的数据删除)。

亮点

  • 规模化的战斗测试。在 LinkedIn、PayPal、Verizon 等公司的生产中以 PB 规模运行。
  • 功能丰富。支持任务分区、增量处理的状态管理、原子数据发布、数据质量检查、任务调度、容错等。
  • 支持流和批执行模式
  • 控制平面(Gobblin-as-a-service)支持数据平面操作的程序化触发和协调。

生产中使用的常见模式

  • Kafka的数据流/批量摄取到数据湖(HDFS, S3, ADLS)
  • 从数据湖中批量加载服务存储(如HDFS -> Couchbase)。
  • 支持跨联邦数据湖的数据同步(HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • 将外部供应商的 API-s(如 Salesforce、Dynamics 等)与数据存储(HDFS、Couchbase 等)集成。
  • 在 HDFS/ADLS 上执行数据保留政策和 GDPR 删除。

Apache Gobblin 不是

  • 像 Spark 或 Flink 这样的通用数据转换引擎。Gobblin 可以将复杂的数据处理任务委托给 Spark、Hive 等。
  • 一个数据存储系统,如 Apache Kafka 或 HDFS。Gobblin 作为源或汇与这些系统集成。
  • 一个通用的工作流执行系统,如 Airflow、Azkaban、Dagster、Luigi。

要求

  • Java >= 1.8

如果在测试开启的情况下构建发行版:

  • Maven 版本为 3.5.3

运行 Apache RAT(发布审核工具)的说明

  • 解压归档文件到你的本地目录。
  • 运行./gradlew rat。报告将在 build/rat/rat-report.html 下生成。

构建发布的说明

解压归档文件到你的本地目录。

  1. 跳过测试,构建发行版。运行 ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain 分布将在 build/gobblin-distribution/distributions 目录下创建。(或)
  2. 运行测试并构建发行版(需要 Maven)。运行./gradlew build 该发行版将在build/gobblin-distribution/distributions 目录下创建。

快速链接


Main metrics

Overview
Name With Ownerapache/gobblin
Primary LanguageJava
Program languageShell (Language Count: 10)
PlatformDocker, Linux, Mac, Windows
License:Apache License 2.0
所有者活动
Created At2014-12-01 18:10:50
Pushed At2025-04-21 04:30:33
Last Commit At2025-04-21 10:00:33
Release Count26
Last Release Namerelease-0.17.0 (Posted on 2023-08-30 20:48:15)
First Release Namegobblin_0.5.0 (Posted on )
用户参与
Stargazers Count2.2k
Watchers Count161
Fork Count749
Commits Count6.5k
Has Issues Enabled
Issues Count0
Issue Open Count0
Pull Requests Count2260
Pull Requests Open Count128
Pull Requests Close Count1417
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

Apache Gobblin

Build Status
Documentation Status
Maven Central
Stack Overflow
Join us on Slack
codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to download gradle wrapper

If you are going to build Gobblin from the source distribution,
run the following command for downloading the gradle-wrapper.jar from Gobblin git repository to gradle/wrapper directory
(replace GOBBLIN_VERSION in the URL with the version you downloaded).

wget --no-check-certificate -P gradle/wrapper https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

(or)

curl --insecure -L https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar > gradle/wrapper/gradle-wrapper.jar

Alternatively, you can download it manually from:
https://github.com/apache/gobblin/blob/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

Make sure that you download it to gradle/wrapper directory.

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution:
    Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain
    The distribution will be created in build/gobblin-distribution/distributions directory.
    (or)
  3. Run tests and build the distribution (requires Maven):
    Run ./gradlew build
    The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links