chombo

Big Data ETL and Utilities for Hadoop Map Reduce, Spark and Storm

Github星跟踪图

Introduction

Hadoop based ETL and various utility classes for Hadoop and Storm

Philosophy

  • Simple to use
  • Input output in CSV format
  • Metadata defined in simple JSON file
  • Extremely configurable with many configuration knobs

Solution

  • Various relational algebra operation, including Projection, Join etc
  • Data extraction ETL to extract structured record from unstructured data
  • Data extraction ETL to extract structured record from JSON data
  • Data validation ETL with configurable rules and statistical parameters
  • Data profiling ETL with various techniques
  • Data transformation ETL with configurable transformation rules
  • Various statistical data exploration solutions
  • Data normalization
  • Seasonal data analysis
  • Various statistical parameter calculation
  • Various long term statistical parameter calculation with incremental data
  • Bulk inset, update and delete of Hadoop data
  • Bases classes for Storm Spout and Bolt
  • Utility classes for string, configuration
  • Utility classes for Storm and Redis

Blogs

The following blogs of mine are good source of details. These are the only source
of detail documentation. Map reduce jobs in this projec are used in other projects
including sifarish, avenir etc. Blogs related to thos projects are also relevant.

Build

For Hadoop 1

  • mvn clean install

For Hadoop 2 (non yarn)

  • git checkout nuovo
  • mvn clean install

For Hadoop 2 (yarn)

  • git checkout nuovo
  • mvn clean install -P yarn

For spark

  • Build chombo first in master branch with
    • mvn clean install
    • sbt publishLocal
  • Build chombo-spark in chombo/spark directory
    • sbt clean package

Need help?

Please feel free to email me at pkghosh99@gmail.com

Contribution

Contributors are welcome. Please email me at pkghosh99@gmail.com

主要指标

概览
名称与所有者epicmaxco/vuestic-admin
主编程语言Vue
编程语言Java (语言数: 6)
平台
许可证MIT License
所有者活动
创建于2017-07-31 11:52:40
推送于2025-01-06 10:54:23
最后一次提交2024-11-22 19:05:20
发布数24
最新版本名称v4.0.0 (发布于 )
第一版名称v1.0.0 (发布于 )
用户参与
星数10.8k
关注者数264
派生数1.8k
提交数1.9k
已启用问题?
问题数609
打开的问题数66
拉请求数345
打开的拉请求数8
关闭的拉请求数113
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?