DASlab  ·  Harvard University

LSM-based Key-Value Stores

Three research breakthroughs taught LSM-based key-value stores to tune themselves —
from optimal Bloom filters to self-navigating design continua.

SIGMOD 2017 Best Paper  ·  SIGMOD 2018  ·  SIGMOD 2019

Overview

Key-value stores power every major internet service — and they all face the same impossible tuning problem.

Every time you send a message, make a purchase, or scroll through a social feed, a key-value store is working behind the scenes. Systems like RocksDB, LevelDB, Cassandra, and HBase power Facebook's social graph, Apple's iCloud, Google's Bigtable, and countless financial trading platforms. At their core they do something deceptively simple — store and retrieve data by key. But at the scale of billions of operations per second across petabytes of data, "simple" becomes extraordinarily hard.

The dominant storage engine behind these systems is the Log-Structured Merge-tree (LSM-tree). LSM-trees balance lookup and update performance by buffering writes in memory, flushing them as sorted runs to disk, and periodically merging runs into larger ones. This design makes writes blazing fast but creates a fundamental three-way trade-off between read cost, write cost, and memory usage.

Tuning an LSM-tree is notoriously difficult — engineers resort to trial-and-error across dozens of interdependent knobs. This research program asks: what if a key-value store could design itself?

Who Uses LSM-Trees?

RocksDB (Meta), LevelDB (Google), Cassandra (Apache/Netflix), HBase (Yahoo), WiredTiger (MongoDB), AsterixDB, and many more. Facebook alone runs trillions of queries per day through RocksDB.

The Core Trade-Off

Read cost vs. write cost vs. memory. Want faster reads? Merge more aggressively or allocate more memory to Bloom filters. Want faster writes? Merge less — but reads get slower. No single setting works for every workload.

LSM-tree design space: Bloom filters, fence pointers, and merge operations

Highlights

Three papers, three years, one unified vision —
each expanding the frontier of what key-value stores can do.

50–90%
Lookup Improvement
Monkey · optimal Bloom filters
Asymptotic Gain
Factor of L (# levels) · for free
0
Extra Memory
Same budget, just redistributed
Pareto
Optimal
Dostoevsky · dominates all prior designs
5
Tuning Knobs
LSM-Bush · unified design continuum
log log N
Write Cost
LSM-Bush · breaks log N barrier

Monkey: Optimal Navigable Key-Value Store

The first breakthrough: not all Bloom filters are created equal. Monkey proves that memory should be distributed non-uniformly across levels — and it reduces lookup cost by a factor of L for free.

Every LSM-tree uses Bloom filters to avoid unnecessary disk reads during lookups. Before Monkey, every system allocated memory to these filters uniformly — the same bits-per-element at every level. This seems fair, but it's deeply wasteful.

The key insight: the worst-case lookup cost is proportional to the sum of false positive rates across all levels. Since smaller levels contain exponentially fewer entries, a small memory redistribution can exponentially reduce their false positive rates. The mathematical result: Monkey's optimal allocation turns the sum from O(L) — linear in the number of levels — to O(1), a constant.

This is a free lunch: 50–90% lookup improvement with zero additional memory. Monkey also builds closed-form cost models for every operation, enabling automatic navigation of the entire design space — co-tuning the merge policy, size ratio, buffer size, and Bloom filter allocation to maximize throughput for any workload.

Monkey mascot
Monkey optimal Bloom filter tuning

Non-Uniform Allocation

Exponentially decreasing false positive rates from the largest to smallest level.

Closed-Form Models

Navigate the full design space with analytic cost models — no benchmarking needed.

L× Improvement

Lookup cost drops by a factor equal to the number of levels. Same total memory.

Holistic Tuning

Co-tunes merge policy, size ratio, buffer, and filters for any workload.

The Analogy

Imagine a building with 10 floors. You're looking for a person. The ground floor has 10 people; the top floor has 10 billion. A "Bloom filter" at each floor is like a receptionist who can quickly say "they're definitely not here" with some probability of error. Before Monkey, every receptionist had the same error rate. Monkey asks: why not make the ground-floor receptionist nearly perfect? It takes almost no memory to make that happen — and every correct "not here" saves a trip upstairs.

Dostoevsky: Optimal Merge Policies

The second breakthrough: most of the merging work in an LSM-tree is unnecessary. Dostoevsky introduces Lazy Leveling and the Fluid LSM-tree, dominating all prior designs for common workloads.

Monkey optimized what we keep in memory. Dostoevsky optimizes the merge process itself. Classical LSM-trees offer only two merge policies: leveling (one run per level, great for reads, costly for writes) and tiering (multiple runs per level, great for writes, costly for reads). Dostoevsky proves this binary choice leaves enormous performance on the table.

Leveling

1 run per level

Great reads, costly writes

LevelDB, RocksDB

Tiering

T−1 runs per level

Great writes, costly reads

Cassandra, HBase

Lazy Leveling

Tiering above, leveling at bottom

Best of both worlds

Dostoevsky (2018)

The core observation: in leveling, the vast majority of merge work happens at the largest level because it holds the most data. The smaller levels are merged frequently too, but this contributes little to read performance. It's like obsessively reorganizing the top shelves of your bookcase when most of the books are in the bottom drawer.

Dostoevsky's Lazy Leveling applies tiering at smaller levels (cheap, deferred merges) and leveling at the largest level only (where it matters most for reads). The result: the same point lookup cost as leveling, the same update cost as tiering, and better short range queries than both.

Going further, the Fluid LSM-tree generalizes this with two tuning knobs: K (runs at smaller levels) and Z (runs at the largest level). With K and Z, the Fluid LSM-tree spans the entire spectrum between pure leveling (K=1, Z=1) and pure tiering (K=T−1, Z=T−1), with Lazy Leveling and every point in between as navigable options.

Lazy Leveling

Tiering at small levels, leveling at the largest only. Best of both worlds.

Fluid LSM-Tree

Tunable K and Z parameters span the full merge-policy spectrum.

Design Smaller Levels Largest Level Best For
Leveling 1 run (leveling) 1 run (leveling) Read-heavy workloads
Tiering T−1 runs (tiering) T−1 runs (tiering) Write-heavy workloads
Lazy Leveling T−1 runs (tiering) 1 run (leveling) Balanced (point lookups + updates)
Fluid LSM-tree K runs (tunable) Z runs (tunable) Any workload (self-tuning)

LSM-Bush: Breaking the O(log N) Barrier

The third breakthrough: what if each level could have its own capacity ratio? LSM-Bush reduces write cost from O(log N) to O(log log N) — a doubly-exponential improvement.

Monkey and Dostoevsky pushed the design space further than ever, but they shared a hidden assumption: the size ratio T between levels is fixed across the entire tree. The LSM-Bush paper drops this constraint, allowing increasing capacity ratios from smaller to larger levels.

The intuition: in a classical LSM-tree, write cost is proportional to the number of levels — O(log N / log T). To reduce it, you increase T, but there's a hard ceiling. LSM-Bush breaks through by setting each level's capacity ratio to the square of the previous one (T, T², T⁴, T⁸, ...). The capacity grows as a tower of exponentials, and the number of levels shrinks from log(N) to log(log(N)). For a trillion entries, that's ~5 levels instead of ~40.

Classic LSM-tree

Uniform T across all levels

T=4
T=4
T=4
T=4
T=4

5 levels · Pyramid shape

Write cost ~ O(log N)

LSM-Bush

Increasing T across levels

T₁=2
T₂=4
T₃=16

3 levels (same data!) · Bush shape

Write cost ~ O(log log N)

LSM-Bush unifies the entire design space opened by Monkey and Dostoevsky into a single framework with five tunable parameters: T (base size ratio), K (runs at smaller levels), Z (runs at the largest level), X (growth rate of capacity ratios), and C (capacity increment at the largest level). Every existing design — classical leveling, tiering, Lazy Leveling — is a single point in this continuum.

The Big Picture

LSM-Bush doesn't just add another option — it reveals that every previous design explored only a tiny slice of a much larger space. The five-parameter continuum enables a store to continuously morph its structure to match changing workloads — truly self-designing storage.

Parameter Controls Range
T Base size ratio between levels ≥ 2
K Number of runs at non-largest levels 1 to T−1
Z Number of runs at the largest level 1 to T−1
X Growth rate of capacity ratios across levels 0 (uniform) to large
C Capacity ratio increment for largest level ≥ 0
Existing Design As a Point in LSM-Bush Space
Classical Leveling (LevelDB) K=1, Z=1, X=0, C=0
Classical Tiering (Cassandra) K=T−1, Z=T−1, X=0, C=0
Lazy Leveling (Dostoevsky) K=T−1, Z=1, X=0, C=0
LSM-Bush (full) K, Z, X, C all tunable — new Pareto-optimal designs

Monkey in Action

Try it yourself. Choose a configuration below or enter custom parameters to visualize the performance trade-off curves and the detailed Bloom filter tuning of the resulting LSM-tree.

All three configurations correspond to a 1 TB LSM-tree with 16 byte entries. The first two reflect the default tuning of two state-of-the-art systems, LevelDB and Cassandra. For each system, the two lines show the benefit of integrating Monkey's memory allocation for Bloom filters. The third configuration is an edge-case where no memory is allocated to the Bloom filters, and the trade-off curves converge.

Dataset

Environment

1. # Entries

2. Entry size (bytes)
3. Page size (bytes)

Main Memory Allocation

Merge Operation Frequency

4. Buffer size (MB)
6. Merge policy leveling  tiering 

5. Bloom filters' overall size (MB)
7. Size ratio

Figure 1. Trade-off curves between the lookup and update cost for Monkey. As a baseline, we also plot the analogous trade-off curve for state-of-the-art designs that assign the same false positive rate to filters across all levels.


Click on one of the scenarios above and get ready for Monkey business.


Figure 2. A detailed comparison of Bloom filter tuning across different levels of the LSM-tree for the state of the art and Monkey. Monkey allocates relatively more main memory to filters at shallower levels of the LSM-tree. The worst-case lookup cost is equal to the sum of false positive rates across all levels, and since shallower levels contain exponentially fewer entries, we can exponentially reduce their false positive by redistributing memory from the wider levels to the shallower levels.

Putting It All Together

Three papers, one unified vision: key-value stores that design themselves. Each paper expanded the dimensionality of the design space.

Before
Fixed Design
Manual tuning
SIGMOD 2017
Monkey
Optimal Memory
SIGMOD 2018
Dostoevsky
Optimal Merging
SIGMOD 2019
LSM-Bush
Optimal Shape
Result
LSM-based KV
5 tunable knobs

Monkey (2017)

Optimized memory allocation. Proved Bloom filter memory should be distributed non-uniformly. Reduced lookup cost by a factor of L for free. Introduced closed-form cost models for navigating the design space.

Dostoevsky (2018)

Optimized merge policy. Discovered most merging at smaller levels is superfluous. Introduced Lazy Leveling and the Fluid LSM-tree with tunable K and Z, dominating all prior merge policies for common workloads.

LSM-Bush (2019)

Optimized tree shape. Broke the O(log N) write barrier with variable capacity ratios. Unified all designs into a five-parameter continuum, achieving O(log log N) writes — a doubly-exponential improvement.

Impact

For industry: Bloom filter optimization from Monkey has been adopted in production systems. The cost models provide a principled way to auto-tune stores instead of relying on expert intuition. For systems serving trillions of queries daily, even a small improvement translates to millions in savings.

For science: The LSM-Bush result shows the long-assumed O(log N) barrier is not fundamental. The design continuum connects to the broader vision of self-designing data systems — systems that automatically select, combine, and morph their internal structures based on the workload.

For developers: Instead of reading a 50-page tuning guide, describe your workload and get a mathematically optimal configuration. The models answer "what-if" questions in microseconds: What if we add memory? What if the workload shifts? What if we move from HDD to SSD?

Monkey working

Publications

Niv Dayan, Manos Athanassoulis, Stratos Idreos
SIGMOD 2017 Best Paper Award  ·  Proceedings of the ACM SIGMOD International Conference on Management of Data, 2017
Niv Dayan, Stratos Idreos
SIGMOD 2018 Proceedings of the ACM SIGMOD International Conference on Management of Data, 2018
Niv Dayan, Stratos Idreos
SIGMOD 2019 Proceedings of the ACM SIGMOD International Conference on Management of Data, 2019

People