Monkey: LSM Tuning - DASlab @ Harvard University

Key-Value Stores: A Complex Design Space

Modern, persistent key-value stores are used throughout numerous applications, from social networks to document stores. We show that the design space of modern key-value stores is complex and that existing designs are sub-optimal.

LSM-tree Background

Typically, key-value stores use an LSM-tree as a storage backend. LSM-trees balance lookup and update performance, by (i) buffering updates, (ii) creating sorted runs, and (iii) periodically merge them forming a series of sorted runs with exponentially increasing size. Searching for a value requires to probe in the worst case all sorted runs, a cost that can modern key-value stores mitigate by employing (a) fence pointers and (b) Bloom filters, both maintained in main-memory. The above design elements of key-value stores exhibit an intrinsic trade-off between lookup cost, update cost, and main memory utilization. Existing designs enable a suboptimal and difficult to tune trade-off among these metrics.

Key-Value Store Design with Monkey

Monkey is an LSM-based key-value store that is able to strike the best possible balance between the costs of updates and lookups for any given application. By mapping the possible design space, we show how to optimally navigate the key-value store design space holistically; Monkey co-tunes the merge policy, the size ratio between levels, the buffer size, and the Bloom filters’ false positive rates.

Monkey improves read performance by optimal allocation of memory to the bloom filters. The core insight is that the worst-case lookup cost is proportional to the sum of the false positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state of the art that assigns a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize the sum of their false positive rates.

Monkey improves the worst-case lookup cost by 50-90% and can also answer what-if questions about how changes in environmental parameters would impact performance and how to adapt the various design elements accordingly.

Monkey in Action

Click on one of the three configuration buttons below or enter a custom configuration to visualize the corresponding performance trade-off curves in Figure 1 and the detailed tuning of the resulting LSM-tree in Figure 2.

All three configurations correspond to 1 TB LSM-tree with 16 byte entries. The first two configurations reflect the default tuning of two state-of-the-art systems, LevelDB and Cassandra respectively. For each system the two lines shown in Figure 1 show the benefit it can achieve by integrating Monkey's memory allocation for Bloom filters. The third configuration is an edge-case where no memory is allocated to the Bloom filters, and so the trade-off curves for Monkey and state-of-the-art designs converge.

LevelDB

Cassandra

No Bloom filters

Dataset

Environment

1. # Entries

2. Entry size (bytes)

3. Page size (bytes)

Main Memory Allocation

Merge Operation Frequency

4. Buffer size (MB)

6. Merge policy leveling tiering

5. Bloom filters' overall size (MB)

7. Size ratio

Figure 1. Trade-off curves between the lookup and update cost for Monkey. As a baseline, we also plot the analogous trade-off curve for state-of-the-art designs that assign the same false positive rate to filters across all levels. Show more...

Click on one of the scenarios above and get ready for Monkey business.

Figure 2. A detailed comparison of Bloom filters tuning across different levels of the LSM-tree for the state of the art and Monkey. Monkey allocates relatively more main memory to filters at shallower levels of the LSM-tree. The worst-case lookup cost is equal to the sum of false positive rates across all levels, and since shallower levels contain exponentially fewer entries, we can exponentially reduce their false positive by redistribute memory from the wider levels to the shallower levels. The LSM-tree structure and visualization in this figure is dynamically generated based on the configuration selected above.