Welcome to CS165: Data Systems for Fall 2024! We hope you all had a great summer. This class is
about state of the art data systems research and practice. It blends numerous topics on data-driven
algorithms, data structures, end-to-end system design along with an exciting semester-long system
project inspired both by open research projects and actual practices in industry right now.
Please take a careful look at our Syllabus and our Self-evaluation guide as well as Project 0.
If you have any questions
please reach out to Stratos at stratos@seas.harvard.edu
/*State-of-the-art data structures, algorithms and data system design*/
/*Research based, Problem based, Semi-flipped classroom,
OH & Labs daily*/
/*Additional sections provided online pre-recorded, Guest lectures from
industry*/
/*Two take-home midterms with
open books, laptops, and tablets (late Oct and Nov)*/
/*Large, open-ended, individual software project in C due at the end
of the semester*/
/*Project Midway Check-in: small deliverable to warm up - hard deadline in mid
October*/
/*Ideal background:
basic data structures, algorithms, and systems programming (C/C++)*/
/*Expected workload: a lot - do not combine with other
heavy courses*/
/*Grading: Project 60%, Class Participation 10%, Midterms 30% + bonus
opportunities*/
/*We count best of Midterm2,
and Midterm1+2*/
/*All info is on the class website (not Canvas), Class forum contains all
announcements/updates*/
/*CS165 helps
with software engineering job placement/interviews*/
/*CS165 helps with research initiation in in algorithms/storage/systems/computational
data science*/
/*For our extension school students: unfortunately due to COVID restrictions, and
unlike previous years, we cannot welcome on campus at all this semester extension school
students even if they are local students; all meetings, office hours, sections and exams
will take place online.*/
/*CS165 students have won the ACM SIGMOD undergraduate research competition four
years in a row in 2016, 2017, 2018, and 2019 */
Do: Read Syllabus; Check self-evaluation guide, semester project; Project 0
If you take the course: Register on Class forum,
& notes; Get starter code / tests
Success strategy: Participate actively in all classes; Go to Labs/OH weekly
What is this class about?
We are in the big data era. Every two days we create as much data as
much we created from the dawn of humanity up to 2003 [Eric Schmidt,
Google]. We are going through major transformations in businesses,
sciences, as well as everyday life - collecting and analyzing data, applying
machine learning across diverse scenarios, and constructing large and
complex data science pipelines change everything; the key common
ingredient in all those cases is storing and accessing large amounts of data.
In fact, this is typically the bottleneck in any complex data-driven process/
program because the way hardware has evolved, moving data is way more expensive than
doing
any kind of computation. Data systems provide the means to store and analyze a massive
amount of
data and thus they sit in the critical path of everything we do. A data system is
effectively a large
collection of data structures and algorithms that work in synchrony to solve complex
data access
requests. It is a $100B industry, growing 10% every year [Economist, "Data, data
everywhere"].
This course is a comprehensive introduction to modern data systems.
The primary focus is on trends that are shaping the
data management industry right now: column-store and hybrid storage systems,
NoSQL key-value stores, shared nothing architectures,
cache conscious algorithms, hardware/software co-design, main-memory systems,
adaptive indexing, stream processing, and scientific
data management.
We also study the history of data systems, traditional and seminal concepts and ideas
such as the relational
model, row-store database systems, optimization, indexing, concurrency control, recovery
and SQL.
In this way, we discuss
both how and why data systems evolved over the years, as well as how these concepts
apply today and
how data systems might
evolve in the future.
We focus on understanding concepts and trends rather than specific techniques that will
soon be outdated
- as such
the class relies on recent research material and on a semi-flipped class model
with a lot of hands-on interaction
in each class.
What is this class not about?
This class is not a traditional introduction on how we use a database system and how to
write SQL. Instead, this is a data
structures and algorithms class about data system design. You will learn how data
systems work at their core and how to design
new systems for emerging applications and hardware. By the way, if you know how systems
work, you also become better at using
them!
Why take this class?
Data systems research and the whole industry are going through a major and continuous
transition;
given that new data-driven
scenarios and applications continuously pop up, there is a continuous need to
redefine what is a good data system design. Every
single big data-driven company, startup, government or scientific organization is
limited by their data system choices and
continuously seeking new solutions tailored to their problems.
Given that there is no perfect data system, there is a constant
need for deep expertise on designing performant workload and hardware conscious data
structures, algorithms and complete
systems.
CS165 exposes students to the core internals of data systems making it possible to understand core trends in data structure, algorithm, and system design and to be one of the very few who know how to design and evaluate systems. In addition, due to the way the course is taught with a focus on interactive problem solving, open topics and the latest research results, this is also a great class for those who want to understand what CS research is all about and how to engage in doing research. So far seven undergraduate teams have made it to the finals of the ACM SIGMOD Undergraduate Research Competition with research projects that came out of the course. In 2016, we won first place with the work on Adaptive Denormalization; in 2017 we won first place with the work on Evolving Trees; in 2018 we won first place with the work on Splaying LSM-Trees; and in 2019 we won first place with a project that shows it is possible to efficiently transition between LSMtree and B-Trees.
The class meets twice a week: Tuesdays and Thursdays 09:45-11:00 am. Classes are designed to be discussion-based and slides are used mainly to drive discussions as opposed to delivering the material.
In this section we will cover the basic knowledge of C required for this project. We will discuss the main features of the language and will focus on pointer arithmetics a key challenge when programming in C.
This section introduces Project 0, a standalone programming exercise that has two goals. First, it will help the students to understand the coding effort and skills needed to carry on the full project, by implementing a hash table. Second, it will be an integral part that can be used as-is in the full project.
In this section we will introduce the semester project. We will discuss the scope of the whole project and introduce all milestones. We will then focus on the first software and design deliverable. We will also cover frequent pains faced throughout the project.
This section covers the introductory code and test generation for the starter code through docker environment.
Section Recording (starter code) Section Recording (docker walkthrough)
In this section we will discuss important development tools. We will talk about debugging tools including [c]gdb and valgrind, and the build tool Gnu make. We will do so by example. The example code is available in the git repository listed below.
Handout (system dev tools) Section Recording
Section Git repository
Automatic
variables in Make
Secondary
expansion in Makefiles
Last year's
notes on git, valgrind, gcc, and gdb
This section covers the how to memory map a file on disk and system calls related to it.
The section of this week is dedicated to memory hierarchy. Understanding memory hierarchy and how cache memory works is crucial for understanding how to build an efficient cache-aware data system. Hence, here, we will start from the basics of memory hierarchy, covering how caching works, what is an L3 and L2 shared cache, and what is an L1 private cache. We will discuss the differences between instruction and data caches and we will discuss how programs incur cache misses and how this affects performance.
After having a clear understanding of memory hierarchy, in this section, we will discuss techniques that allow us to build cache-conscious algorithms. We will discuss how to minimize cache misses and how to avoid branch mispredictions by removing branches altogether from our code.
In this Section we start by covering processes, virtual memory, and virtual address to physical address translation. After a brief overview of each of those topics, we discuss the TLB, what it's used for, and how to use the TLB efficiently in your project.
Section RecordingThis section describes how your data is actually stored onto disk.
This section describes how your manage your database catalog, i.e., the metadata.
In this section we will address performance optimizations and techniques to build high performance code in the context of the project. We will discuss performance monitoring tools (perf) which allow us to know exactly where does the execution time goes and help us understand whether our implementation is efficient, and where are any possible performance bottleneck.
In this section we cover techniques for parallel programming. We will talk about synchronization primitives, such as mutex locks and atomic operations, as well as design patterns used to parallelize database workloads. We discuss this in the context of both the class and the project.
In this section we will introduce the concept of ACID transactions and the manifestation of this concept throughout the evolution of database systems.
In this section we will cover locking and how it ensures consistency and isolation.
This section covers the principles of maintaining correctness and consistency in a database. We will discuss how atomicity and durability are ensured though logging.
This section gives a brief example of a midterm question and answer. It is meant to give you an idea of how to answer questions for midterms.
Question (2019) Solution 2019 (part 1) Solution 2019 (part 2)
This section extends students’ knowledge with a specialized type of data system — spatial database.
This section describes how you can estimate the performance of your database operators.
On select evenings the instructor, and DASlab PhDs and postdocs will hold meetings to discuss research! They will present their recent work on and connect it with the class material. Then, you will get the chance to talk with them about their research, open problems and be exposed to open research opportunities. Snacks and drinks will be provided. In addition, we hold meetings to discuss careers in industry and academia, grad school and anything else you may have in mind.
Regular office hours and lab schedules are posted to the homepage of this website.
Attendance at Labs and OH is optional but all students will benefit from regular discussions with the teaching staff.
OH: You are welcome to come to OH every day. The goal of OH is to provide any kind of feedback on the class material. You should come to OH to ask questions about past classes and quizzes. You should also come to OH to discuss the design of your project and to get feedback on your design documents. You are also welcome to come to OH for any other general question regarding classes, carriers in industry/academia, PhDs, etc.
Labs: The TFs will hold Labs every day. Labs can help with similar discussions as with OH but the main goal of Labs is to provide hands-on help for the project. So bring your laptop and your questions about specific project parts you need help with. Labs are the place to go when you have a persistent bug, when you need help with a specific tool for the project (e.g., for debugging or performance testing) or to get feedback about the quality of your coding.
Finding and fixing bugs can be very difficult and time consuming. As such, we want to make the time you spend in Labs is as useful as possible. We want to teach you the process of finding and fixing bugs, not just solve a bug for you. We expect that before coming to labs you have spend several hours "fighting" a bug. Then if you cannot make any more progress on your own, you should come by and by then you will have enough experience to really understand the solution and the process. Do not feel like something wrong is happening if you find yourself stack with a bug for a day or two. This is normal and part of the learning process. It will and should happen several times through the semester. Before coming to discuss a bug you should perform/answer the following:
The class is geared towards engaging creative thinking and problem solving to give students a feeling of how computer science research takes place. Many of our students in the past have successfully engaged in research projects with DASlab and published research papers. So far seven students/teams have made it to the finals of the ACM SIGMOD Undergrad Research Competition. In 2016 we won first place with the work on Adaptive Denormalization , in 2017 we won first place with the work on Evolving Trees, ,in 2018 we won first place with the work on Splaying LSM-Trees, in 2019 we won first place with the work on LSM-Trees and B-Trees, in 2020 we won first place with the work on Accurate Latency Prediction for Key-Value Storage Engines, in 2021 we won third place with the work on Learning Algorithms for Automatic Data Structure Design, and in 2022 we won first place with the work on Workload-Adaptive Filtering in Storage Engines. The project and the classes will give you plenty of triggers on new problems to work on.
Talk to the instructor at any point if you are excited about pursuing independent research during or after the course.