roshi

Roshi implements a time-series event storage via a LWW-element-set CRDT with
limited inline garbage collection. Roshi is a stateless, distributed layer on
top of Redis and is implemented in Go. It is partition tolerant, highly
available and eventually consistent.
At a high level, Roshi maintains sets of values, with each set ordered
according to (external) timestamp, newest-first. Roshi provides the following
API:
- Insert(key, timestamp, value)
- Delete(key, timestamp, value)
- Select(key, offset, limit) []TimestampValue
Roshi stores a sharded copy of your dataset in multiple independent Redis
instances, called a cluster. Roshi provides fault tolerance by duplicating
clusters; multiple identical clusters, normally at least 3, form a farm.
Roshi leverages CRDT semantics to ensure consistency without explicit
consensus.
Use cases
Roshi is basically a high-performance index for timestamped data. It's
designed to sit in the critical (request) path of your application or service.
The originating use case is the SoundCloud stream; see this blog post
for details.
Theory and system properties
Roshi is a distributed system, for two reasons: it's made for datasets that
don't fit on one machine, and it's made to be tolerant against node failure.
Next, we will explain the system design.
CRDT
CRDTs (conflict-free replicated data types) are data types on which the same
set of operations yields the same outcome, regardless of order of execution
and duplication of operations. This allows data convergence without the need
for consensus between replicas. In turn, this allows for easier implementation
(no consensus protocol implementation) as well as lower latency (no wait-time
for consensus).
Operations on CRDTs need to adhere [to the following rules][mixu]:
- Associativity (a+(b+c)=(a+b)+c), so that grouping doesn't matter.
- Commutativity (a+b=b+a), so that order of application doesn't matter.
- Idempotence (a+a=a), so that duplication doesn't matter.
Data types as well as operations have to be specifically crafted to meet these
rules. CRDTs have known implementations for counters, registers, sets, graphs,
and others. Roshi implements a set data type, specifically the Last Writer
Wins element set (LWW-element-set).
This is an intuitive description of the LWW-element-set:
- An element is in the set, if its most-recent operation was an add.
- An element is not in the set, if its most-recent operation was a remove.
A more formal description of a LWW-element-set, as informed by
[Shapiro][shapiro], is as follows: a set S is represented by two internal
sets, the add set A and the remove set R. To add an element e to the set S,
add a tuple t with the element and the current timestamp t=(e, now()) to A. To
remove an element from the set S, add a tuple t with the element and the
current timestamp t=(e, now()) to R. To check if an element e is in the set S,
check if it is in the add set A and not in the remove set R with a higher
timestamp.
Roshi implements the above definition, but extends it by applying a sort of
instant garbage collection. When inserting an element E to the logical set S,
check if E is already in the add set A or the remove set R. If so, check the
existing timestamp. If the existing timestamp is lower than the incoming
timestamp, the write succeeds: remove the existing (element, timestamp) tuple
from whichever set it was found in, and add the incoming (element, timestamp)
tuple to the add set A. If the existing timestamp is higher than the incoming
timestamp, the write is a no-op.
Below are all possible combinations of add and remove operations.
A(elements...) is the state of the add set. R(elements...) is the state of
the remove set. An element is a tuple with (value, timestamp). add(element)
and remove(element) are the operations.
Original state