DASlab  ·  Harvard University

Limousine

Blending learned and classical indexes to self-design
larger-than-memory cloud storage engines.

ACM SIGMOD 2024

Overview

Key-value storage engines are the backbone of modern big data systems, yet today's engines rely on fixed, static designs that do not scale gracefully with growing data volumes and shifting workloads. As datasets grow, classical in-memory navigational structures (filters, indexes) consume linearly more memory, inflating cloud costs — sometimes by 5× — with about 80% of the budget going to purchasing extra RAM.

Limousine is a self-designing key-value storage engine that automatically morphs to the near-optimal architecture given a workload, a cloud budget, and a target performance. It extends and generalizes our earlier system Cosine by introducing a fundamentally new capability: blending learned and classical data structures within a single engine so that the system can scale to larger-than-memory datasets without exploding cloud costs.

By identifying the first principles of storage engine design as combinations of learned, classical, and hybrid ("clearned") components, Limousine constructs a massive design space of 1048 possible storage engine designs across diverse hardware and three major cloud providers (AWS, GCP, and Azure). It then searches this space in seconds to find the best design for a given context, and automatically materializes the result as ready-to-use Rust code.

Why "Limousine"?

Like a limousine that blends luxury with utility, Limousine blends the best of learned indexes (memory-efficient, distribution-aware) with classical structures (robust writes, proven guarantees) — delivering both performance and cost scalability in one engine.

Highlights

Limousine-designed engines outperform state-of-the-art systems
across diverse workloads, datasets, and cloud budgets.

1000×
Performance Gain
vs. RocksDB, WiredTiger, FASTER
1048
Candidate Designs
Learned × Classical × Hybrid
Rust
Auto-Generated Code
From design spec to runnable engine
3
Cloud Providers
AWS · GCP · Azure
Cost Reduction
Learned indexes shrink memory bill
Seconds
Design Time
Exhaustive search of the continuum

The Design Space: Classical, Learned & Clearned

Limousine identifies three families of navigational structures — classical, learned, and clearned (hybrids) — and combines them with diverse data layout and merge strategies to span an enormous space of possible engines.

Classical Structures

Bloom filters, fence pointers, B-Trees — proven navigational structures with linear memory growth.

Learned Structures

Models that learn data distributions to create compact, tailored indexes — trading write speed for memory savings.

Clearned Hybrids

Novel combinations that pair learned indexes with classical filters and buffers for balanced read-write performance.

Code Materialization

Once the optimal design is chosen, Limousine automatically generates production-ready Rust code.

Existing storage engines each support a single, fixed design — an LSM-tree in RocksDB, a B-tree in WiredTiger, a log-structured hash table in FASTER. This locks in their performance characteristics and makes them inherently sub-optimal for workloads they were not designed for.

Limousine breaks this constraint by formalizing the first principles of storage engine layout: how data is organized across levels, how buffers absorb writes, how navigational structures (indexes, filters) accelerate reads, and how merge policies maintain the structure over time. Each of these axes admits multiple choices — classical or learned — and their combinations yield trillions of valid engine designs that have never appeared in literature or industry.

A key innovation is the lazy write algorithm for engines with learned components. Because learned indexes must be retrained after data changes, naively rebuilding them on every write is prohibitively expensive. Limousine introduces efficient lazy write strategies that amortize retraining cost while maintaining read accuracy, enabling holistic read-write performance optimization for the first time in engines that contain learned structures.

Distribution-Aware IO Model

Limousine uses a unified, distribution-aware IO model to accurately predict the cost and performance of any candidate design on any workload and hardware.

The Problem

With 1048 possible designs, there is no way to benchmark them all. We need analytical models that can evaluate any design in microseconds, not minutes.

Our Solution

Limousine's IO model captures the interplay between data distribution, storage layout, and cloud hardware to calculate latency and dollar cost for every query type — reads, writes, and range scans.

Pareto Continuum

The model enables Limousine to construct a navigable Pareto frontier of cost vs. performance, letting engineers pick the sweet spot for their budget and SLA requirements.

Publications

Subarna Chatterjee, Mark F. Pekala, Lev Kruglyak, Stratos Idreos
Proceedings of the ACM on Management of Data (SIGMOD), Vol. 2, No. 1, Article 47, February 2024
Subarna Chatterjee, Meena Jagadeesan, Wilson Qin, Stratos Idreos
Proceedings of the VLDB Endowment (PVLDB), 2022

People