ballista

PoC of distributed compute platform using Rust, Apache Arrow, and Kubernetes!

Github星跟踪图

Ballista

License
Version
Gitter Chat

Overview

Ballista is an experimental distributed compute platform based on Kubernetes and Apache Arrow that I am developing in my spare time for fun.

Note that the project has pivoted since the original PoC and is currently being re-implemented. The most significant change is that this is no longer a pure Rust project. I still believe that Rust is a great language for this project, but it can't be the only language. One of the key benefits of Arrow is that it supports multiple languages, including C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust. It should therefore be possible for the Ballista architecture to support more than one language. Users need the ability to execute custom code as part of a distributed compute job and likely have existing code. Users are also likely to want compatibility with more traditional data science languages such as Python or R, as well as Java.

Ballista Goals

  • Define a logical query plan in protobuf format. See ballista.proto
  • Provide DataFrame style interfaces for JVM (Java, Kotlin, Scala), Rust, and Python
  • Provide a JDBC Driver, allowing Ballista to be used from existing BI and SQL tools
  • Use Apache Flight for sending query plans between nodes, and streaming results between nodes
  • Allow clusters to be created, consisting of executors implemented in any language that supports Flight
  • Distributed compute jobs should be capable of invoking code in more than one language (with some performance trade-offs for IPC overhead)
  • Provide integrations with Apache Spark (e.g. Spark V2 Data Source allowing Spark to interact with Ballista)

Ballista Anti Goals

  • Ballista is not intended to replace Apache Spark but to augment it

Status

I learned a lot from the initial PoC (see below for a demo and more info) but have decided to start the project again due to the changes in scope mentioned above so the project is currently in a state of flux and nothing works right now but I am in the process of building a second PoC.

Here is a rough plan for delivering PoC #2:

  • Implement a Rust server implementing Flight protocol that can receive a logical plan and validate it and execute it (in progress)
  • Implement a Kotlin DataFrame client that can build a plan and execute it against the Rust server (in progress)
  • Implement a Rust DataFrame client that can build a plan and execute it against the Rust server (in progress)
  • Implement a JDBC driver that can execute a SQL statement against the Rust server (in progress)
  • Implement a Scala server implementing the Flight protocol that can receive a logical plan and translate it to Spark and execute it
  • Build a benchmark client in Kotlin that can run against the Rust and Scala servers

PoC #1

This demo shows a Ballista cluster being created in Minikube and then shows the nyctaxi example being executed, causing a distributed query to run in the cluster, with each executor pod performing an aggregate query on one partition of the data, and then the driver merges the results and runs a secondary aggregate query to get the final result.

asciicast

Here are the commands being run, with some explanation:

# create a cluster with 12 executors
cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml

# check status
kubectl get pods

# run the nyctaxi example application, that executes queries using the executors
cargo run --bin ballista -- run --name nyctaxi --template examples/nyctaxi/templates/application.yaml

# check status again to find the name of the application pod
kubectl get pods

# watch progress of the application
kubectl logs -f ballista-nyctaxi-app-n5kxl

Note that PoC #1 is now archived here.

Contributing

See CONTRIBUTING.md for information on contributing to this project.

主要指标

概览
名称与所有者ballista-compute/ballista
主编程语言Rust
编程语言Rust (语言数: 6)
平台
许可证Apache License 2.0
所有者活动
创建于2019-07-04 17:09:41
推送于2021-04-20 00:28:06
最后一次提交
发布数17
最新版本名称v0.4.1 (发布于 )
第一版名称0.1.0 (发布于 )
用户参与
星数2.3k
关注者数71
派生数147
提交数606
已启用问题?
问题数251
打开的问题数0
拉请求数366
打开的拉请求数0
关闭的拉请求数38
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?