ballista

PoC of distributed compute platform using Rust, Apache Arrow, and Kubernetes!

Github星跟蹤圖

Ballista

License
Version
Gitter Chat

Overview

Ballista is an experimental distributed compute platform based on Kubernetes and Apache Arrow that I am developing in my spare time for fun.

Note that the project has pivoted since the original PoC and is currently being re-implemented. The most significant change is that this is no longer a pure Rust project. I still believe that Rust is a great language for this project, but it can't be the only language. One of the key benefits of Arrow is that it supports multiple languages, including C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust. It should therefore be possible for the Ballista architecture to support more than one language. Users need the ability to execute custom code as part of a distributed compute job and likely have existing code. Users are also likely to want compatibility with more traditional data science languages such as Python or R, as well as Java.

Ballista Goals

  • Define a logical query plan in protobuf format. See ballista.proto
  • Provide DataFrame style interfaces for JVM (Java, Kotlin, Scala), Rust, and Python
  • Provide a JDBC Driver, allowing Ballista to be used from existing BI and SQL tools
  • Use Apache Flight for sending query plans between nodes, and streaming results between nodes
  • Allow clusters to be created, consisting of executors implemented in any language that supports Flight
  • Distributed compute jobs should be capable of invoking code in more than one language (with some performance trade-offs for IPC overhead)
  • Provide integrations with Apache Spark (e.g. Spark V2 Data Source allowing Spark to interact with Ballista)

Ballista Anti Goals

  • Ballista is not intended to replace Apache Spark but to augment it

Status

I learned a lot from the initial PoC (see below for a demo and more info) but have decided to start the project again due to the changes in scope mentioned above so the project is currently in a state of flux and nothing works right now but I am in the process of building a second PoC.

Here is a rough plan for delivering PoC #2:

  • Implement a Rust server implementing Flight protocol that can receive a logical plan and validate it and execute it (in progress)
  • Implement a Kotlin DataFrame client that can build a plan and execute it against the Rust server (in progress)
  • Implement a Rust DataFrame client that can build a plan and execute it against the Rust server (in progress)
  • Implement a JDBC driver that can execute a SQL statement against the Rust server (in progress)
  • Implement a Scala server implementing the Flight protocol that can receive a logical plan and translate it to Spark and execute it
  • Build a benchmark client in Kotlin that can run against the Rust and Scala servers

PoC #1

This demo shows a Ballista cluster being created in Minikube and then shows the nyctaxi example being executed, causing a distributed query to run in the cluster, with each executor pod performing an aggregate query on one partition of the data, and then the driver merges the results and runs a secondary aggregate query to get the final result.

asciicast

Here are the commands being run, with some explanation:

# create a cluster with 12 executors
cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml

# check status
kubectl get pods

# run the nyctaxi example application, that executes queries using the executors
cargo run --bin ballista -- run --name nyctaxi --template examples/nyctaxi/templates/application.yaml

# check status again to find the name of the application pod
kubectl get pods

# watch progress of the application
kubectl logs -f ballista-nyctaxi-app-n5kxl

Note that PoC #1 is now archived here.

Contributing

See CONTRIBUTING.md for information on contributing to this project.

主要指標

概覽
名稱與所有者ballista-compute/ballista
主編程語言Rust
編程語言Rust (語言數: 6)
平台
許可證Apache License 2.0
所有者活动
創建於2019-07-04 17:09:41
推送於2021-04-20 00:28:06
最后一次提交
發布數17
最新版本名稱v0.4.1 (發布於 )
第一版名稱0.1.0 (發布於 )
用户参与
星數2.3k
關注者數71
派生數147
提交數606
已啟用問題?
問題數251
打開的問題數0
拉請求數366
打開的拉請求數0
關閉的拉請求數38
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?