ballista

PoC of distributed compute platform using Rust, Apache Arrow, and Kubernetes!

Github stars Tracking Chart

Ballista

License
Version
Gitter Chat

Overview

Ballista is an experimental distributed compute platform based on Kubernetes and Apache Arrow that I am developing in my spare time for fun.

Note that the project has pivoted since the original PoC and is currently being re-implemented. The most significant change is that this is no longer a pure Rust project. I still believe that Rust is a great language for this project, but it can't be the only language. One of the key benefits of Arrow is that it supports multiple languages, including C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust. It should therefore be possible for the Ballista architecture to support more than one language. Users need the ability to execute custom code as part of a distributed compute job and likely have existing code. Users are also likely to want compatibility with more traditional data science languages such as Python or R, as well as Java.

Ballista Goals

  • Define a logical query plan in protobuf format. See ballista.proto
  • Provide DataFrame style interfaces for JVM (Java, Kotlin, Scala), Rust, and Python
  • Provide a JDBC Driver, allowing Ballista to be used from existing BI and SQL tools
  • Use Apache Flight for sending query plans between nodes, and streaming results between nodes
  • Allow clusters to be created, consisting of executors implemented in any language that supports Flight
  • Distributed compute jobs should be capable of invoking code in more than one language (with some performance trade-offs for IPC overhead)
  • Provide integrations with Apache Spark (e.g. Spark V2 Data Source allowing Spark to interact with Ballista)

Ballista Anti Goals

  • Ballista is not intended to replace Apache Spark but to augment it

Status

I learned a lot from the initial PoC (see below for a demo and more info) but have decided to start the project again due to the changes in scope mentioned above so the project is currently in a state of flux and nothing works right now but I am in the process of building a second PoC.

Here is a rough plan for delivering PoC #2:

  • Implement a Rust server implementing Flight protocol that can receive a logical plan and validate it and execute it (in progress)
  • Implement a Kotlin DataFrame client that can build a plan and execute it against the Rust server (in progress)
  • Implement a Rust DataFrame client that can build a plan and execute it against the Rust server (in progress)
  • Implement a JDBC driver that can execute a SQL statement against the Rust server (in progress)
  • Implement a Scala server implementing the Flight protocol that can receive a logical plan and translate it to Spark and execute it
  • Build a benchmark client in Kotlin that can run against the Rust and Scala servers

PoC #1

This demo shows a Ballista cluster being created in Minikube and then shows the nyctaxi example being executed, causing a distributed query to run in the cluster, with each executor pod performing an aggregate query on one partition of the data, and then the driver merges the results and runs a secondary aggregate query to get the final result.

asciicast

Here are the commands being run, with some explanation:

# create a cluster with 12 executors
cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml

# check status
kubectl get pods

# run the nyctaxi example application, that executes queries using the executors
cargo run --bin ballista -- run --name nyctaxi --template examples/nyctaxi/templates/application.yaml

# check status again to find the name of the application pod
kubectl get pods

# watch progress of the application
kubectl logs -f ballista-nyctaxi-app-n5kxl

Note that PoC #1 is now archived here.

Contributing

See CONTRIBUTING.md for information on contributing to this project.

Main metrics

Overview
Name With Ownerballista-compute/ballista
Primary LanguageRust
Program languageRust (Language Count: 6)
Platform
License:Apache License 2.0
所有者活动
Created At2019-07-04 17:09:41
Pushed At2021-04-20 00:28:06
Last Commit At
Release Count17
Last Release Namev0.4.1 (Posted on )
First Release Name0.1.0 (Posted on )
用户参与
Stargazers Count2.3k
Watchers Count71
Fork Count147
Commits Count606
Has Issues Enabled
Issues Count251
Issue Open Count0
Pull Requests Count366
Pull Requests Open Count0
Pull Requests Close Count38
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private