Three research breakthroughs taught LSM-based key-value stores to tune themselves —
from optimal Bloom filters to self-navigating design continua.
Key-value stores power every major internet service — and they all face the same impossible tuning problem.
Every time you send a message, make a purchase, or scroll through a social feed, a key-value store is working behind the scenes. Systems like RocksDB, LevelDB, Cassandra, and HBase power Facebook's social graph, Apple's iCloud, Google's Bigtable, and countless financial trading platforms. At their core they do something deceptively simple — store and retrieve data by key. But at the scale of billions of operations per second across petabytes of data, "simple" becomes extraordinarily hard.
The dominant storage engine behind these systems is the Log-Structured Merge-tree (LSM-tree). LSM-trees balance lookup and update performance by buffering writes in memory, flushing them as sorted runs to disk, and periodically merging runs into larger ones. This design makes writes blazing fast but creates a fundamental three-way trade-off between read cost, write cost, and memory usage.
Tuning an LSM-tree is notoriously difficult — engineers resort to trial-and-error across dozens of interdependent knobs. This research program asks: what if a key-value store could design itself?
RocksDB (Meta), LevelDB (Google), Cassandra (Apache/Netflix), HBase (Yahoo), WiredTiger (MongoDB), AsterixDB, and many more. Facebook alone runs trillions of queries per day through RocksDB.
Read cost vs. write cost vs. memory. Want faster reads? Merge more aggressively or allocate more memory to Bloom filters. Want faster writes? Merge less — but reads get slower. No single setting works for every workload.
Three papers, three years, one unified vision —
each expanding the frontier of what key-value stores can do.
The first breakthrough: not all Bloom filters are created equal. Monkey proves that memory should be distributed non-uniformly across levels — and it reduces lookup cost by a factor of L for free.
Every LSM-tree uses Bloom filters to avoid unnecessary disk reads during lookups. Before Monkey, every system allocated memory to these filters uniformly — the same bits-per-element at every level. This seems fair, but it's deeply wasteful.
The key insight: the worst-case lookup cost is proportional to the sum of false positive rates across all levels. Since smaller levels contain exponentially fewer entries, a small memory redistribution can exponentially reduce their false positive rates. The mathematical result: Monkey's optimal allocation turns the sum from O(L) — linear in the number of levels — to O(1), a constant.
This is a free lunch: 50–90% lookup improvement with zero additional memory. Monkey also builds closed-form cost models for every operation, enabling automatic navigation of the entire design space — co-tuning the merge policy, size ratio, buffer size, and Bloom filter allocation to maximize throughput for any workload.
Exponentially decreasing false positive rates from the largest to smallest level.
Navigate the full design space with analytic cost models — no benchmarking needed.
Lookup cost drops by a factor equal to the number of levels. Same total memory.
Co-tunes merge policy, size ratio, buffer, and filters for any workload.
Imagine a building with 10 floors. You're looking for a person. The ground floor has 10 people; the top floor has 10 billion. A "Bloom filter" at each floor is like a receptionist who can quickly say "they're definitely not here" with some probability of error. Before Monkey, every receptionist had the same error rate. Monkey asks: why not make the ground-floor receptionist nearly perfect? It takes almost no memory to make that happen — and every correct "not here" saves a trip upstairs.
The second breakthrough: most of the merging work in an LSM-tree is unnecessary. Dostoevsky introduces Lazy Leveling and the Fluid LSM-tree, dominating all prior designs for common workloads.
Monkey optimized what we keep in memory. Dostoevsky optimizes the merge process itself. Classical LSM-trees offer only two merge policies: leveling (one run per level, great for reads, costly for writes) and tiering (multiple runs per level, great for writes, costly for reads). Dostoevsky proves this binary choice leaves enormous performance on the table.
1 run per level
Great reads, costly writes
LevelDB, RocksDB
T−1 runs per level
Great writes, costly reads
Cassandra, HBase
Tiering above, leveling at bottom
Best of both worlds
Dostoevsky (2018)
The core observation: in leveling, the vast majority of merge work happens at the largest level because it holds the most data. The smaller levels are merged frequently too, but this contributes little to read performance. It's like obsessively reorganizing the top shelves of your bookcase when most of the books are in the bottom drawer.
Dostoevsky's Lazy Leveling applies tiering at smaller levels (cheap, deferred merges) and leveling at the largest level only (where it matters most for reads). The result: the same point lookup cost as leveling, the same update cost as tiering, and better short range queries than both.
Going further, the Fluid LSM-tree generalizes this with two tuning knobs: K (runs at smaller levels) and Z (runs at the largest level). With K and Z, the Fluid LSM-tree spans the entire spectrum between pure leveling (K=1, Z=1) and pure tiering (K=T−1, Z=T−1), with Lazy Leveling and every point in between as navigable options.
Tiering at small levels, leveling at the largest only. Best of both worlds.
Tunable K and Z parameters span the full merge-policy spectrum.
| Design | Smaller Levels | Largest Level | Best For |
|---|---|---|---|
| Leveling | 1 run (leveling) | 1 run (leveling) | Read-heavy workloads |
| Tiering | T−1 runs (tiering) | T−1 runs (tiering) | Write-heavy workloads |
| Lazy Leveling | T−1 runs (tiering) | 1 run (leveling) | Balanced (point lookups + updates) |
| Fluid LSM-tree | K runs (tunable) | Z runs (tunable) | Any workload (self-tuning) |
The third breakthrough: what if each level could have its own capacity ratio? LSM-Bush reduces write cost from O(log N) to O(log log N) — a doubly-exponential improvement.
Monkey and Dostoevsky pushed the design space further than ever, but they shared a hidden assumption: the size ratio T between levels is fixed across the entire tree. The LSM-Bush paper drops this constraint, allowing increasing capacity ratios from smaller to larger levels.
The intuition: in a classical LSM-tree, write cost is proportional to the number of levels — O(log N / log T). To reduce it, you increase T, but there's a hard ceiling. LSM-Bush breaks through by setting each level's capacity ratio to the square of the previous one (T, T², T⁴, T⁸, ...). The capacity grows as a tower of exponentials, and the number of levels shrinks from log(N) to log(log(N)). For a trillion entries, that's ~5 levels instead of ~40.
Uniform T across all levels
5 levels · Pyramid shape
Write cost ~ O(log N)
Increasing T across levels
3 levels (same data!) · Bush shape
Write cost ~ O(log log N)
LSM-Bush unifies the entire design space opened by Monkey and Dostoevsky into a single framework with five tunable parameters: T (base size ratio), K (runs at smaller levels), Z (runs at the largest level), X (growth rate of capacity ratios), and C (capacity increment at the largest level). Every existing design — classical leveling, tiering, Lazy Leveling — is a single point in this continuum.
LSM-Bush doesn't just add another option — it reveals that every previous design explored only a tiny slice of a much larger space. The five-parameter continuum enables a store to continuously morph its structure to match changing workloads — truly self-designing storage.
| Parameter | Controls | Range |
|---|---|---|
| T | Base size ratio between levels | ≥ 2 |
| K | Number of runs at non-largest levels | 1 to T−1 |
| Z | Number of runs at the largest level | 1 to T−1 |
| X | Growth rate of capacity ratios across levels | 0 (uniform) to large |
| C | Capacity ratio increment for largest level | ≥ 0 |
| Existing Design | As a Point in LSM-Bush Space |
|---|---|
| Classical Leveling (LevelDB) | K=1, Z=1, X=0, C=0 |
| Classical Tiering (Cassandra) | K=T−1, Z=T−1, X=0, C=0 |
| Lazy Leveling (Dostoevsky) | K=T−1, Z=1, X=0, C=0 |
| LSM-Bush (full) | K, Z, X, C all tunable — new Pareto-optimal designs |
Try it yourself. Choose a configuration below or enter custom parameters to visualize the performance trade-off curves and the detailed Bloom filter tuning of the resulting LSM-tree.
All three configurations correspond to a 1 TB LSM-tree with 16 byte entries. The first two reflect the default tuning of two state-of-the-art systems, LevelDB and Cassandra. For each system, the two lines show the benefit of integrating Monkey's memory allocation for Bloom filters. The third configuration is an edge-case where no memory is allocated to the Bloom filters, and the trade-off curves converge.
Figure 1. Trade-off curves between the lookup and update cost for Monkey. As a baseline, we also plot the analogous trade-off curve for state-of-the-art designs that assign the same false positive rate to filters across all levels.
Click on one of the scenarios above and get ready for Monkey business.
Figure 2. A detailed comparison of Bloom filter tuning across different levels of the LSM-tree for the state of the art and Monkey. Monkey allocates relatively more main memory to filters at shallower levels of the LSM-tree. The worst-case lookup cost is equal to the sum of false positive rates across all levels, and since shallower levels contain exponentially fewer entries, we can exponentially reduce their false positive by redistributing memory from the wider levels to the shallower levels.
Three papers, one unified vision: key-value stores that design themselves. Each paper expanded the dimensionality of the design space.
Optimized memory allocation. Proved Bloom filter memory should be distributed non-uniformly. Reduced lookup cost by a factor of L for free. Introduced closed-form cost models for navigating the design space.
Optimized merge policy. Discovered most merging at smaller levels is superfluous. Introduced Lazy Leveling and the Fluid LSM-tree with tunable K and Z, dominating all prior merge policies for common workloads.
Optimized tree shape. Broke the O(log N) write barrier with variable capacity ratios. Unified all designs into a five-parameter continuum, achieving O(log log N) writes — a doubly-exponential improvement.
For industry: Bloom filter optimization from Monkey has been adopted
in production systems. The cost models provide a principled way to auto-tune
stores instead of relying on expert intuition. For systems serving trillions of
queries daily, even a small improvement translates to millions in savings.
For science: The LSM-Bush result shows the long-assumed O(log N)
barrier is not fundamental. The design continuum connects to the broader vision of
self-designing data systems — systems that automatically select, combine,
and morph their internal structures based on the workload.
For developers: Instead of reading a 50-page tuning guide, describe
your workload and get a mathematically optimal configuration. The models answer
"what-if" questions in microseconds: What if we add memory? What if the workload
shifts? What if we move from HDD to SSD?