The systems project for CS265 is designed to provide background on state-of-the-art systems, data structures, and algorithms. It includes a design component and an implementation component in C or C++, dealing with low level systems issues such as memory management, hardware conscious processing, parallel processing, managing read/write tradeoffs, and scalability. Systems projects will be done individually, each student is required to work on their own. This is a focused project that should not necessarily result in many lines of code (like the cs165 project), but will exercise your understanding of modern system design issues and tradeoffs.
The goal of the project is to build a modern key-value store. The system will be based on an LSM tree design. There are two parts in the project. In the first part, you will build a basic single thread key-value store that will include all major design components as in state-of-the-art key-value stores. In the second part, you will extend the design to support multiple threads and be able to process multiple queries in parallel, while utilizing all cores of a modern server machine.
Students that finish the systems project quickly and want to work on research will be able to do so by exploring open research topics directly on top of their LSM-tree. We have plenty of open topics in this area.
As storage becomes cheaper, and modern applications require both efficient lookups and updates, LSM-trees provide a good balance between the two by avoiding expensive in-place updates. They work by buffering updates (and deletes) in main memory and deferring their persistence to a later time. While such designs slightly penalize reads, which then need to search various different levels of data structures, they allow writes to be performed efficiently, and dramatically increase data ingestion rates.
Today, LSM-trees are used in a variety of systems, typically key-value stores, as a means of fast indexing of heavily updated data. Google’s BigTable, Facebook's Cassandra, and a number of research prototypes have implemented their own version of LSM-trees.
Data Layout. An LSM-tree consists of a hierarchy of storage levels. The first level, called C0 or buffer, is generally stored in main memory, and its purpose is to buffer updates for efficiency. The rest of the levels are stored on disk and are usually append only. The figure above on the left gives a high level visual of an LSM-tree.
Updates. Updates in LSM-trees are fast since they are always first performed in main memory (level C0) and they are gradually moved to the disk at later phases. Updates and inserts are treated in the same way, since K-V stores do not allow for duplicates. Deletes are also performed in the same way as updates, but with a special marker, which denotes this record as “deleted”. Since each level of an LSM-tree is restricted in size using thresholds, each of these operations are merged to the highest levels of the LSM-tree in a deferred and batched method. This process is called merging and is triggered when C0 reaches its threshold size.
Merging. The merge process creates one new sorted array, called a run and pushes it to higher levels, possibly causing a cascade through other levels. Initially a run is constructed on the data of C0 and flushed to higher levels. If the flushed run now also causes C1 to reach is capacity threshold, then this run is sort merged with the contents of C1, forming one larger run and pushed one level higher in a cascading way. When data are sort-merged, deletes and updates are performed, keeping only the newest values. This process can be seen on the figure above on the right hand size.
Lookups. A single key lookup is performed first in C0, and if no match is found, this lookup is propagated to higher levels of the tree from the most recent (in-memory) to the oldest (largest). Range queries are also evaluated against all levels of the tree. To avoid doing a binary search on every single run during query answering, various optimizations can be performed. Common ones are: fence pointers and Bloom filters. Fence pointers are ranges formed by the first records of each page (every X pages), allowing a lookup to access only the part of a run with the range that a target key falls into. Bloom filters on the other hand are used to speed up single key probes, if the bloom filter on a run returns false, we don’t need to search it.
The project is divided into two parts. The first part is about designing the basic structure of an LSM-tree for reads and writes, while the second part is about designing and implementing the same functionality in a parallel way so we can support multiple concurrent reads and writes. Each part will have a first milestone: a design document. Here you will discuss your design and development strategy and get feedback from the teaching staff.
The project is purposely left open ended. That is you have complete freedom in how you design each part of your key-value store to achieve as much functionality as possible. For example, each level of the LSM-tree may be designed in its own way; each level may be a complex data structure itself, like a tree, or a plain one, like a simple array. The choice usually depends on the size of the available memory and the intended purpose of the design. We do expect a “minimum design” where every level is a sorted array and includes a bloom filter to easily determine if a key is not contained in the level. Students may alter this design in an effort to achieve faster reads or faster writes but they should not revert to simpler designs (e.g., making a level just a plain array).
In addition to your chosen run strategy, your design must include bloom filters and fence pointers (or experimental/theoretical justification for omitting them). Bloom filters are held in main memory to determine, during a get, whether a key exists in the tree (or more commonly exists at a given level of the tree). Fence pointers are a collection of keys that are also kept in memory. Since each run on disk is sorted (or at least behaves as though it were sorted) but may span many pages, the first value of each page (a fence pointer) is maintained in memory which allows a search for a specific key to cost no more than one I/O.
Other design decisions include the size ratio between the levels, whether each level will consist of one or more structures (e.g., sorted arrays), and how merging happens and when it is triggered open to the designer (you). You should consult the papers Monkey and Dostoevsky (link to be provided soon) for the modern approach to leveling, tiering, merge policies, and memory allocation across the fence pointers and bloom filters.
We expect students to justify, in detail, decisions about these core design options.
Each student should describe, implement, and evaluate at least three different optimizations that support the performance requirements.
Every implementation should strictly support the CS265 API.
You can find us every day during Labs and OH to discuss your design and ideas for optimizations. Once you are done with the basic design, research opportunities begin as there are many open opportunities in the design space of LSM-trees.
Midway check-in (March 9)
Final Project Evaluations (May 3 - May 12)
The final deliverable consists of:
Each student with a systems project will have two 30 minute evaluation meetings at the end of the semester. A code review and demo meeting with the TFs and an experimental analysis meeting with Stratos. For remote students the evaluation meeting will take place via Zoom and screen sharing.
The code deliverable of this project is an LSM Tree implementation (both single threaded and parallel) with a number of tunable parameters: size ratio between levels, storage layer used for each level (RAM, SSD, HDD), different merging strategies, and any additional tuning parameters the students design. The final LSM Tree is expected to support high update rate (in the order of 100K-1M updates per second for flash storage and 1K-10K of updates per second for HDD storage), while at the same time provide efficient reads (in the order of 1K-5K reads per second for flash storage and 20-100 reads per second for HDD storage). Key design decisions include, among others: how to protect the levels during merging, how to do the merging, and how to avoid reading unnecessary levels (if any). The parallel LSM Tree is expected to scale with number of cores, that is, we expect to see that as the number of cores used increases the performance of the tree is precipitously increasing. The software will be submitted as a code.seas repository with at least read access for the teaching staff. If you have not previously used code.seas please see this guide.
The students are expected to generate a report presenting their final design, as well as the behavior and performance of their system. This report will be structured as a short systems research paper, having the following sections:
Students are expected to create a demonstration scenario where they will show the behavior of their LSM Trees. In essence we expect to see in the demo different datasets and workloads that will lead to different behavior of the LSM Tree in order to get a clear idea about its behavior. For example, a workload consisting of skewed updates (inserting values from a small set of values again and again) and a workload consisting of evenly distributed updates, will lead to different behavior of the system. Combining this, with either skewed or evenly distributed reads we get four workloads that will behave very differently. We expect to see such cases in the demo in order to understand the trade-offs of the LSM Tree designed.
The demo is expected to have a 10/15-minute duration, focusing on the design decisions and the knobs used in the presented LSM design. A code review will also take place during this time. Students are expected to be able to quickly make changes in their code to adjust behavior and rerun experiments.
We will provide a domain specific language (though much more simple than that of CS165) along with tests and expected results. As well as a data/workload generator, which can be found here. The purpose of this is to allow you as much flexibility in your design and API as possible while still allowing for testing on our part. Every project will support six commands: put, get, range, delete, load, and print stats. Each command is explained in greater detail below.
The put command will insert a key-value pair into the LSM tree. Duplicates are not supported in the expected implementation so the repeated put of a key updates the value stored in the tree.
p [INT1] [INT2]
The 'p' indicates that this is a put command with key
INT1 and value
p 10 7 p 63 222 p 10 5
First the key 10 is added with value 7. Next, key 63 is added with value 222. The tree now holds two pairs. Finally, the key 10 is updated to have value 5. Note that the tree logically still holds only two pairs. These instructions include only puts therefore no output is expected.
The get command takes a single integer, the key, and returns the current value associated with that key.
The 'g' indicates that this is a get for key
INT1. The current value should be printed on a single line if the key exists and a blank line printed if the key is not in the tree.
p 10 7 p 63 222 g 10 g 15 p 15 5 g 15
Here we first put (key:value) 10:7, then 63:222. The next instruction is a get for key 10, so the system outputs 7, the current value of key 10. Next we try to get key 15 but it is not included in the tree yet so a blank line is generated in the output. Next we put 15:5. Finally, we get key 15 again. This time it outputs 5 as the key exists at this point in the instruction list.
The range command works in a similar way to get but looks for a sequence of keys rather than a single point.
r [INT1] [INT2]
The 'r' indicates that this is a range request for all the keys from
INT1 inclusive to
INT2 exclusive. If the range is completely empty than the output should be a blank line, otherwise the output should be a space delineated list of all the found pairs (in the format key:value); order is irrelevant.
p 10 7 p 13 2 p 17 99 p 12 22 r 10 12 r 10 15 r 14 17 r 0 100
10:7 10:7 12:22 13:2 10:7 12:22 13:2 17:99
In this example, we first put a number of keys and then collect the values with the range command. The first range prints only the pair 10:7. The pair associated with key 12 is not printed because ranges are half open:
[lower_key, higher_key). The next range doesn't include key 17 but captures the others. The range
[14,17) is empty so a blank line is output. Finally an oversized range captures all of the pair.
The delete command will remove a single key-value pair from the LSM tree.
The 'd' indicates that this is a delete command for key
INT1 and its associated value (whatever that may be). A delete of a non-existing key has no effect.
p 10 7 p 12 5 g 10 d 10 g 10 g 12
First the two pairs 10:7 and 12:5 are put into the tree. Next, we get key 10 which outputs the value 7. Key 10 is then deleted which generates no output. When we then try to get key 10 a second time we get a blank as the key has been deleted. Finally we get key 12 which is unaffected by the delete and therefore outputs the value 5.
The load command is included as a way to insert many values into the tree without the need to read and parse individual ascii lines.
The 'l' command code is used for load. This command takes a single string argument. The string is a relative or absolute path to a binary file of integers that should be loaded into the tree. Note that there is no guarantee that the new values being inserted are absent from the current tree or unrepeated in the file, only that all of the pairs are puts in the order of the file.
This command will load the key value pairs contained in
load_file.bin in the home directory of the user running the workload. The layout of the binary file is
There is nothing but key-value pairs in the file. The first 4 bytes are the first key, the next 4 bytes are the first value and so on. Thus, a file of 32768 bytes has 4096 key value pairs to be loaded (though there could be fewer than 4096 pairs once loaded due to existing keys in the tree or duplicated keys in the file).
The print stats command is a command that allows the caller to view some information about the current state of the tree. This is helpful for debugging as well as the final evaluation. Each implementation may vary somewhat but at a minimum the function must include: (1) the number of logical key value pairs in the tree
Total Pairs: [some count]; (2) the number of keys in each level of the tree
LVL1: [Some count], ... , LVLN [some count]; (3) a dump of the tree that includes the key, value, and which level it is on
The 's' command code is used for printing stats. This command takes no arguments and prints out the stats for the tree.
This command will print out at a minimum the data listed above. For a very small tree one output might look like:
Total Pairs: 12 LVL1: 3, LVL3: 11 45:56:L1 56:84:L1 91:45:L1 7:32:L3 19:73:L3 32:91:L3 45:64:L3 58:3:L3 61:10:L3 66:4:L3 85:15:L3 91:71:L3 95:87:L3 97:76:L3