Apache Pinot

Apache Pinot(正在孵化)-- 一个实时分布式 OLAP 数据存储。「Apache Pinot (Incubating) - A realtime distributed OLAP datastore」

  • Owner: apache/pinot
  • Platform: Amazon AWS, Docker, Google Cloud Platform, Kubernetes, Linux, Microsoft Azure
  • License:: Apache License 2.0
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

Apache Pino

什么是 Apache Pinot?

Apache Pinot(正在孵化中)是一个实时分布式 OLAP 数据存储,它的构建是为了提供低延迟的可扩展实时分析。它可以从批处理数据源(如 Hadoop HDFS、Amazon S3、Azure ADLS、Google 云存储)以及流数据源(如 Apache Kafka)进行摄取。

Pinot 是由 LinkedIn 和 Uber 的工程师打造的,其设计是为了无上限地扩展和退出。性能始终保持不变,基于您的集群大小和每秒预期查询(QPS)阈值。

有关入门指南、部署方法、教程等内容,请访问我们的项目文档:https://docs.pinot.apache.org

特性

Pinot 最初是在 LinkedIn 建立的,以支持丰富的交互式实时分析应用,如 Who Viewed ProfileCompany AnalyticsTalent Insights 等。UberEats 餐厅经理 是另一个面向客户的分析 App 的例子。在 LinkedIn,Pinot 为 50 多个面向用户的产品提供动力,每秒摄取数百万事件,以毫秒级的延迟为每秒10万+查询提供服务。

  • Column-oriented:面向列的数据库,有多种压缩方案,如 Run Length、Fixed Bit Length。
  • Pluggable indexing(可插拔索引):可插拔索引技术排序索引、位图索引、反转索引。
  • 查询优化:能够根据查询和段元数据优化查询/执行计划。
  • 流和批量摄取:从流中近乎实时地摄取,从 Hadoop 中批量摄取。
  • 用 SQL 进行查询:类 SQL 语言,支持对数据进行选择、聚合、过滤、分组、排序、不同的查询。
  • 实时摄取过程中的 Upsert:在规模上更新数据,并保持一致性。
  • 多值字段:支持多值字段,允许你以逗号分隔的值查询字段。
  • Kubernetes 上的云原生:Helm chart 提供了一个水平可扩展和容错的集群部署,易于使用 Kubernetes 管理。

什么时候应该使用 Pinot?

Pinot 旨在对海量数据和事件执行低延迟的实时 OLAP 查询。除了实时流摄取外,Pinot 还支持批处理用例,并具有同样的低延迟保证。它适用于需要在不可变数据上进行快速分析(如聚合)的情境,可能,是实时数据摄取。Pinot 在查询具有大量维度和指标的时间序列数据时效果非常好。

示例查询:

SELECT sum(clicks), sum(impressions) FROM AdAnalyticsTable
  WHERE
       ((daysSinceEpoch >= 17849 AND daysSinceEpoch <= 17856)) AND
       accountId IN (123456789)
  GROUP BY
       daysSinceEpoch TOP 100

Pinot 不是数据库的替代品,它不能作为真实存储的来源,不能改变数据。虽然 Pinot 支持文本搜索,但它不能替代搜索引擎。另外,Pinot 查询默认不能跨多个表。你可以使用 Presto-Pinot connector 来实现表连接和其他功能。

构建 Pinot

更详细的说明可以在文档的 快速演示 部分找到。

# Clone a repo
$ git clone https://github.com/apache/incubator-pinot.git
$ cd incubator-pinot
# Build Pinot
$ mvn clean install -DskipTests -Pbin-dist
# Run the Quick Demo
$ cd pinot-distribution/target/apache-pinot-incubating-<version>-SNAPSHOT-bin/apache-pinot-incubating-<version>-SNAPSHOT-bin
$ bin/quick-start-batch.sh

将 Pinot 部署到 Kubernetes 上

请参考我们项目文档中的《在 Kubernetes 上运行 Pinot》。Pinot 还提供了 Kubernetes 与交互式查询引擎 Presto 和数据可视化工具 Apache Superset 的集成。

加入社区

文档

查看 Pinot 文档,了解 Pinot 功能的完整描述。

许可证

Apache Pinot 采用 Apache License,2.0 版本


Overview

Name With Ownerapache/pinot
Primary LanguageJava
Program languageJava (Language Count: 13)
PlatformAmazon AWS, Docker, Google Cloud Platform, Kubernetes, Linux, Microsoft Azure
License:Apache License 2.0
Release Count23
Last Release Namerelease-1.1.0 (Posted on 2024-03-22 23:46:42)
First Release Namerelease-0.1.0 (Posted on 2019-03-07 15:15:09)
Created At2014-05-19 23:27:48
Pushed At2024-04-16 12:13:18
Last Commit At2024-04-16 02:39:15
Stargazers Count5.1k
Watchers Count232
Fork Count1.2k
Commits Count11.4k
Has Issues Enabled
Issues Count2530
Issue Open Count1288
Pull Requests Count8978
Pull Requests Open Count227
Pull Requests Close Count1176
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

Apache Pinot (incubating)

Build Status codecov.io Join the chat at https://gitter.im/linkedin/pinot license

Apache Pinot is a realtime distributed OLAP datastore, which is used to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally.

These presentations on Pinot give an overview of Pinot:

Looking for the ThirdEye anomaly detection and root-cause analysis platform? Check out the Pinot/ThirdEye project

Key Features

  • A column-oriented database with various compression schemes such as Run Length, Fixed Bit Length
  • Pluggable indexing technologies - Sorted Index, Bitmap Index, Inverted Index, Star-Tree Index
  • Ability to optimize query/execution plan based on query and segment metadata
  • Near real time ingestion from Kafka and batch ingestion from Hadoop
  • SQL like language that supports selection, aggregation, filtering, group by, order by, distinct queries on fact data
  • Support for multivalued fields
  • Horizontally scalable and fault tolerant

Because of the design choices we made to achieve these goals, there are certain limitations present in Pinot:

  • Pinot is not a replacement for database i.e it cannot be used as source of truth store, cannot mutate data
  • Not a replacement for search engine i.e full text search, relevance not supported
  • Query cannot span across multiple tables

Pinot works very well for querying time series data with lots of Dimensions and Metrics. Example - Query (profile views, ad campaign performance, etc.) in an analytical fashion (who viewed this profile in the last weeks, how many ads were clicked per campaign).

Instructions to build Pinot

More detailed instructions can be found at Quick Demo section in the documentation.

# Clone a repo
$ git clone https://github.com/apache/incubator-pinot.git
$ cd incubator-pinot

# Build Pinot
$ mvn clean install -DskipTests -Pbin-dist

# Run the Quick Demo
$ cd pinot-distribution/target/apache-pinot-incubating-<version>-SNAPSHOT-bin
$ bin/quick-start-batch.sh

Deploy Pinot on Kubernetes

Please refer to Kubernetes Readme to deploy Pinot using Helm and load demo data set.

Pinot also provides k8s integration with interactive query engine Presto and data visualization tool Apache Superset.

Getting Involved

Documentation

Check out Pinot documentation for a complete description of Pinot's features.

License

Apache Pinot is under Apache License, Version 2.0

To the top