CS265: Big Data Systems

Welcome to CS265 Big Data Systems. This is a research-oriented class about the fundamental principles behind big data systems for diverse data science applications. This year the focus will be specifically in SQL, NoSQL, Neural Networks, LLMs, and Image AI. The paper/class schedule shown below is from last year - the new schedule will be available in mid-January. The Syllabus is available here. The paper schedule is tentative and there will likely be some small updates.

Note: Action steps: 1) Read the syllabus carefully, 2) Check the self-evaluation guide and do Test 0 if you have not taken CS165, 3) Register for Paper Presentations (week 4) 4) Start submitting your paper reviews (week 4).

Overview

Big data is everywhere. A fundamental goal across numerous modern businesses and sciences is to be able to exploit as many machines as possible, to consume as much information as possible and as fast as possible. The big challenge is "how to turn data into useful knowledge". This is a moving target as both the underlying hardware and our ability to collect data evolve. In this class, we will discuss how to design data systems and algorithms for key data-driven areas, including relational systems, distributed systems, graph systems, noSQL, newSQL, machine learning and neural networks. We will see how they all rely on the same set of very basic concepts and we will learn how to synthesize efficient solutions for any problem across these areas using those basic concepts.

In each class we will read one recent research paper and the instructor will lead a discussion around the technical aspects of the work but crucially also on how the fundamental concepts of this work connect with other fields, applications and problems. Students will write two reviews (summary, critique, ideas) per week on the assigned papers for discussion. Each student is expected to present at least once during the semester: each class starts with a short presentation by a student for the paper of the day. In addition, each student will participate in a semester long project (systems or research project). Research projects are integrated with the work at DASlab and are meant to lead to a research publication within the next 1-2 years.

Central Learning Outcomes: Understanding fundamental concepts in data storage and access; Learning to read and quickly understand research papers; Learning to prepare and deliver clear presentations on complex topics; Getting a feeling on what it means to do research.

Term

Spring 2025

Classes

Tuesday/Thursday 9:45-11:00 AM @ SEC 2.118

Join Class Live

Zoom Link

Level

Graduate (open to undergraduate students)

Office Hours and Labs

Stratos OH: Tue/Thu 11:00am-11:30am (outside classroom), Fri noon-12:30pm (Zoom link)

TF labs in-person: Tuesday 1-2pm, Thursday 3:30-4:30pm (SEC 4.435)

TF labs remote: Wednesday 10-11am, Saturday 10-11am (Zoom link)

Class Philosophy

CS265 has unlimited office hours, unlimited late days for project deliverables, relies on the latest research papers instead of a standard text book, lectures are based on interaction and discussion instead of just "lecturing", many of the quizzes and problem sets are actually open research problems and most of all it is fun! The instructor and the TFs are here to help you every day and at all times throughout the semester. You may request as many meetings as you like and as much help as you want.

The class is also geared towards engaging creative thinking and problem solving to give students a feeling of how computer science research takes place. Many of our students in the past have successfully engaged in research projects with DASlab and published research papers.

Lectures

The class meets twice a week.

1. Introduction 6 lectures
Class 1: Introduction to Storage January 28
We will discuss the logistics and goals of the course. We will start by introducing the idea that all data driven algorithms, models and systems rely on a small set of fundamental design concepts. Understudying in detail those concepts and how to use them allows one to optimize any data-driven process from individual algorithms, to SQL, NoSQL, Big Data systems, Neural Networks, Graph systems, and many more.
Download Slides
Preparation Readings:

Get familiar with the very basics of traditional database architectures: Architecture of a Database System. By J. Hellerstein, M. Stonebraker and J. Hamilton. Foundations and Trends in Databases, 2007

Get familiar with very basics of modern database architectures: The Design and Implementation of Modern Column-store Database Systems. By D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden. Foundations and Trends in Databases, 2013

Get familiar with the very basics of modern large scale systems: Massively Parallel Databases and MapReduce Systems. By Shivnath Babu and Herodotos Herodotou. Foundations and Trends in Databases, 2013
Class 2: Deriving Design Space of Storage January 30
We will discuss how to discover the fundamental design principles in major storage schemes. We will focus on modern write-optimized NoSQL systems and also show how to improve modern system designs by orders of magnitude by understudying their design space. We will then present the unified design space of key value data structures which reveals many more structures than what has been invented in the past five decades.
Download Slides
Preparation Readings:

Get familiar with the very basics of traditional database architectures: Architecture of a Database System. By J. Hellerstein, M. Stonebraker and J. Hamilton. Foundations and Trends in Databases, 2007

Get familiar with very basics of modern database architectures: The Design and Implementation of Modern Column-store Database Systems. By D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden. Foundations and Trends in Databases, 2013

Get familiar with the very basics of modern large scale systems: Massively Parallel Databases and MapReduce Systems. By Shivnath Babu and Herodotos Herodotou. Foundations and Trends in Databases, 2013

Readings:

The Periodic Table of Data Structures. Stratos Idreos, Kostas Zoumpatianos, Manos Athanassoulis, Niv Dayan, Brian Hentschel, Michael S. Kester, Demi Guo, Lukas Maas, Wilson Qin, Abdul Wasay, Yiyou Sun. IEEE Data Engineering Bull. Sep, 2018
Class 3: NoSQL Advances Using the Design Space February 4
We will show how the methodology of discovering and investigating the fundamental design principles of data layouts leads not only to crafting the possible designs but also enables us to see fundamentally new results. We will show specific examples from the NoSQL key-value store domain with designs that set the bloom filter and merge policy tuning in LSM-trees to optimize for a given workload, resulting in key-value store designs that are order of magnitudes faster than traditional designs.x`
Download Slides
Readings:

The Periodic Table of Data Structures. Stratos Idreos, Kostas Zoumpatianos, Manos Athanassoulis, Niv Dayan, Brian Hentschel, Michael S. Kester, Demi Guo, Lukas Maas, Wilson Qin, Abdul Wasay, Yiyou Sun. IEEE Data Engineering Bull. Sep, 2018

Monkey: Optimal Navigable Key-Value Store. Niv Dayan, Manos Athanassoulis, Stratos Idreos. ACM SIGMOD International Conference on Data Management, 2017

Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging. Niv Dayan and Stratos Idreos. ACM SIGMOD International Conference on Data Management, 2018.

Optimal Bloom Filters and Adaptive Merging for LSM-Trees. Niv Dayan, Manos Athanassoulis, Stratos Idreos. ACM Transactions on Database Systems, 2018

The Log-Structured Merge-Bush & the Wacky Continuum. Niv Dayan and Stratos Idreos. ACM SIGMOD International Conference on Data Management, 2019.
Class 4: The Periodic Table of Data Structures February 6
We will present in detail the complete design space of key-value data structures and show examples on how specific designs can be synthesized. We will then summarize everything by presenting the periodic table of data structures and start the discussion on how we can approach the challenge of automatically synthesizing the cost of arbitrary data structure designs without implementing them.
Download Slides
Readings:

Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? Michael Kester, Manos Athanassoulis, Stratos Idreos. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2017
Class 5: Learned Cost Models February 11
We will discuss in detail how to perform accurate cost synthesis of arbitrary data structure designs using the concept of learned cost models to capture data, hardware, algorithmic and engineering properties that are otherwise very hard to model. We will show how to utilize learned cost models for accurate cost synthesis as well several open problems as new hardware, such as non-volatile memory, becomes available.
Download Slides
Readings:

The Data Calculator: Data Structure Design and Cost Synthesis From First Principles, and Learned Cost Models. Stratos Idreos, Konstantinos Zoumpatianos, Brian Hentschel, Michael Kester, Demi Guo. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2018
Class 6: Data Structure Design Continuums February 13
We will show how the design space can reveal the opportunity to see design continuums in between well known data structures that are traditionally perceived as fundamentally different. In particular we will show how LSM-tree, and B-tree can be viewed effectively as the same data structure opening exciting opportunities for a new class of self-designing key-value stores that blend performance properties from both, using machine learning, code generation and traditional modeling.
Download Slides
Readings:

Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn. Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, and Zichen Zhu. Proceedings of CIDR Conference on Innovative Data Systems Research, 2019.
Class 7: Synthesizing Full Systems (NoSQL) Part 1 February 18

Download Slides

The Data Calculator: Data Structure Design and Cost Synthesis From First Principles, and Learned Cost Models. Stratos Idreos, Konstantinos Zoumpatianos, Brian Hentschel, Michael Kester, Demi Guo. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2018.

Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn. Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, and Zichen Zhu. Proceedings of CIDR Conference on Innovative Data Systems Research, 2019.

Limousine: Blending Learned and Classical Indexes to Self-Design Larger-than-Memory Cloud Storage Engines. Subarna Chatterjee, Mark F. Pekala, Lev Kruglyak, and Stratos Idreos. In Proceedings of the ACM Management of Data 2, 1, Article 47 (February 2024), (SIGMOD), 2024.

Cosine: A Cloud-Cost Optimized Self-Designing Key-Value Storage Engine. Subarna Chatterjee, Meena Jagadeesan, Wilson Qin, and Stratos Idreos. In Proceedings of the Very Large Databases Endowment, (PVLDB), 2022.

The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage Format. Utku Sirin and Stratos Idreos. In Proceedings of the ACM Management of Data 2, 1, Article 52 (February 2024), (SIGMOD), 2024.

μ-TWO: 3× Faster Multi-Model Training with Orchestration and Memory Optimization Storage, Scheduling, and Networking. Sanket Purandare, Abdul Wasay, Stratos Idreos, Animesh Jain MLSys 2023, The Annual Conference on Machine Learning and Systems.
Class 8: Synthesizing Full Systems (NoSQL) Part 2 February 20

Download Slides

Class 9: Synthesizing Full Systems (NoSQL) Part 3 February 25

Download Slides

Class 10: Hardware Conscious Neural Network Training Part 1 February 27

We will learn about the basics of distributed neural network training from a systems point of view to understand the factors that affect performance, utilization, memory usage and communication. We then deep dive into investigating some performance bottlenecks and expose interesting research problems and discuss potential directions.

Class 11: Hardware Conscious Neural Network Training Part 2 March 4

We will learn about the basics of distributed neural network training from a systems point of view to understand the factors that affect performance, utilization, memory usage and communication. We then deep dive into investigating some performance bottlenecks and expose interesting research problems and discuss potential directions.

Class 12: Image Storage for AI Part 1 March 6

We will learn the basics of image storage and discuss its limitations for today’s artificial intelligence (AI) applications. We will then move into our image storage for AI project and discuss the research projects we propose.

Class 13: Image Storage for AI Part 2 March 11

We will learn the basics of image storage and discuss its limitations for today’s artificial intelligence (AI) applications. We will then move into our image storage for AI project and discuss the research projects we propose.

Class 14: Image Storage for AI Part 3 March 13

Image Storage for AI Part 3
2. Discussion 8 lectures

No Class: University Spring Recess March 18

No Class: University Spring Recess

No Class: University Spring Recess March 20

No Class: University Spring Recess

D1: Paper 1 March 25

FASTER: A Concurrent Key-Value Store with In-Place Updates.
Badrish Chandramouli, Guna Prasaad, Donald Kossmann, Justin Levandoski, James Hunter, Mike Barnett

2018 ACM SIGMOD International Conference on Management of Data (SIGMOD '18), Houston, TX, USA

D2: Paper 2 March 27

The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar Code.
Azim Afroozeh, Peter Boncz

In Proceedings of the VLDB Endowment, 2023

D3: Paper 3 April 1

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, Shen Li

Proc. VLDB Endow. 16(12): 3848-3860 (2023)

D4: Paper 4 April 3

Megatron LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro

Arxiv, 2020

D5: Paper 5 April 8

Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models.
Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, Gustavo Alonso

Proc. VLDB Endow. 18(1): 42-52 (2024)

D6: Paper 6 April 10

AquaPipe: A Quality-Aware Pipeline for Knowledge Retrieval and Large Language Models.
Runjie Yu, Weizhou Huang, Shuhan Bai, Jian Zhou, and Fei Wu

Proc. ACM Manag. Data 3, 1, Article 11 (February 2025), 26 pages.

D7: Paper 7 April 15

Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers..
Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Luo Mai, Paolo Costa, Peter R. Pietzuch

Proc. VLDB Endow. 12(11): 1399-1413 (2019)

D8: Paper 8April 17

Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training Datasets.
Supun Nakandala, Arun Kumar

SIGMOD Conference 2022: 506-520

No Class (Labs) April 22

Labs only.

No Class (Labs) April 24

Labs only.

Last class: Research project presentation April 29

Research project presentations by students.

FAQ

What is a Data System?

Data systems are literally everywhere. We are using them directly or indirectly every day all day long for numerous basic or not so basic tasks, e.g., when we are buying coffee to when we are booking airplane tickets. They provide the backbone of all modern businesses to manage their data and of course they provide the backbone of online businesses and environments such as social networks and search engines. They are also used increasingly in science as data analytics becomes more and more the fundamental barrier in generating knowledge.

What is this class not about?

This class is not a traditional introduction on how we use a database system and how to write SQL. Instead, this is a systems class about data system design. You will learn how big data systems work at their core and how to design new systems for emerging applications and hardware. By the way, if you know how systems work, you also become better at using them!

Why take this class?

Data is everywhere. Every year we create even more data. As it stands, every two days we create as much data as much we created from the dawn of humanity up to 2003 [Eric Schmidt, Google]. Sciences, businesses, and everyday life are substantially affected. Data systems are in the middle of all this. Data systems are how we store and access data, i.e., they are the backbone of any data-driven application. It is a $100B industry, growing 10% every year [Economist, “Data, data everywhere”].

At the same time data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, there is a continuous need to redefine what is a good data system design in such a dynamic environment.

CS265 exposes students to the core internals of data systems making it possible to understand core trends in system design and to be one of the few who know how to design and evaluate systems. In addition, due to the way the course is taught (focus on interactive problem solving, open topics and the latest research results) this is also a great class for those who want to understand what CS research is all about and how to engage in doing research.

What is the expected learning outcome?

Learn state-of-the-art research and industry trends in big data systems.
Understand the tradeoffs in designing and implementing modern big data systems.
Be able to make design decisions in big data driven scenarios.
Understand the fundamental principles that govern all systems out how these apply across diverse areas: SQL, NoSQL, Neural Networks, Blockchain, Statistics, Data Science, Vision.
Develop basic research skills: reading, writing and understanding research papers.
Deepen C programming, debugging, and performance profiling skills.

Efficient data analytics and system design is all about how we store and access the data. In this class, you are going to see how the same concepts appear again and again in numerous data-driven scenarios from NoSQL to neural networks.

How does 265 compare to 165?

If you took and liked CS165, you will like CS265 as well. From a material point of view CS265 moves on to consider additional topics as a continuation of CS165 such as distributed processing, transaction processing, graph processing, machine learning and more. In terms of the way the class is taught, it is even more interactive, and even more research oriented. Semester projects are actually on open research problems with the potential to lead to a publication and every class is focused on a single research paper, and understanding it in detail.

How much work is it?

You may have heard stories about CS165 and wondering if CS265 is going to be equally hard or you may have taken CS165 and wondering if this is going to be a similar amount of work. CS165 and CS265 are different style of classes. While CS165 is much more focused on implementation leading to a full system prototype, CS265 is more focused on ideas and design. In other words, you may have written 5-10K lines of code (some even more!) for CS165 but in CS265 you are more likely going to write small amounts of code and mostly play with alternative ways to design a specific functionality, structure or algorithm to highlight the effect of different choices and to find out new ways to solve a specific problem.

Who can take this class?

IF you have taken CS165, CS161, CS261 GOTO next question; ELSE see below;

Background: Naturally, the more background you have the smoother your experience in 265 will be. Prior knowledge of C programming and systems programming, as well as a good understanding of computer architecture and in particular the memory hierarchy (cache memories) is very important for this class. Courses providing systems background (like CS50 and in particular CS61 or equivalent) are essential. Good hacking, algorithm designing, and data structures skills are also required.

If you are graduate student and have taken a mix of systems (database, operating and distributed systems) classes in the past, then you will be OK and we will provide enough background so you can follow. CS265 does satisfy the systems requirement towards a PhD.

If you are a senior in college and this is your last chance to take this class: if you have taken CS61 but no CS161 or CS165 then talk to the instructor to evaluate how fit your are for the class. If you have not taken CS61 but do have significant systems programming experience you may still qualify.

In all other cases, it is a better idea to take CS165 first.

How can I do great in 265?

Just utilize all resources provided. Show up in class to participate in interactive sessions. There are also daily office hours and labs; show up as often as possible so we can help with anything you need! When you find yourself stuck with the project either with a design decision or just a bug, it is normal to struggle for a while — it is part of the learning process — but after some time grab your laptop and come by!

What can I do to prepare?

Especially if you have not taken CS165 it is a good idea to spend some time preparing before the semester starts and during the early weeks of the semester even if you consider yourself an expert systems student. The best approach is to browse some fundamental readings in data systems architectures. We propose that you take a look at the following texts from the CS165 readings:

Get familiar with the very basics of traditional database architectures: Architecture of a Database System. By J. Hellerstein, M. Stonebraker and J. Hamilton. Foundations and Trends in Databases, 2007
Get familiar with very basics of modern database architectures: The Design and Implementation of Modern Column-store Database Systems. By D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden. Foundations and Trends in Databases, 2013
Get familiar with the very basics of modern large scale systems: Massively Parallel Databases and MapReduce Systems. By Shivnath Babu and Herodotos Herodotou. Foundations and Trends in Databases, 2013

Test 0: We provide a Test 0 that is designed to 1) help you get an idea about how fit you are for the class and 2) bootstrap your C coding skills. Essentially Test 0 consists of an independent data structure design and implementation in C that will allow you to practice basic system design, coding and debugging skills. In addition, several fundamental section videos are posted on the class website about system coding and profiling to help you with that.

Plagiarism

You are responsible for understanding Harvard and Harvard Extension School policies on academic integrity and how to use sources responsibly. Not knowing the rules, misunderstanding the rules, running out of time, submitting "the wrong draft", or being overwhelmed with multiple demands are not acceptable excuses. There are no excuses for failure to uphold academic integrity. To support your learning about academic citation rules, please visit the Harvard College Guidelines to Avoid Plagiarism, where you'll find links to the Harvard Guide to Using Sources and two, free, online 15-minute tutorials to test your knowledge of academic citation policy. The tutorials are anonymous open-learning tools.

Accessibility

Harvard and the Extension School are committed to providing an accessible academic community. The Disability Services Office offers a variety of accommodations and services to students with documented disabilities. Please visit http://www.extension.harvard.edu/resources-policies/resources/disability-services-accessibility for more information and do not hesitate to contact prof. Idreos directly, by email, with any questions or concerns you might have.

Sections

Sections are offered only online as pre-recorded videos. We will be adding section videos through the semester based on the needs of the students. If you have questions for any one of the sections come by one of our daily OH and Labs.

Background sections

These are a series of sections from CS165 which are useful for students who do not have prior experience or need to refresh their knowledge in C, and development tools needed for C programming.

Introduction in C

In this section we will cover the basic knowledge of C required for this project. We will discuss the main features of the language and will focus on pointer arithmetics a key challenge when programming in C.
Section Recording

Test 0 (warm-up exercise)

This section introduces Test 0 (warm-up exercise), a standalone programming exercise that has two goals. First, it will help the students to understand the coding effort and skills needed to carry on the full project, by implementing a hash table. Second, it will be an integral part that can be used as-is in the full project.
Section Recording

Memory Hierarchy and Caches

The section of this week is dedicated to memory hierarchy. Understanding memory hierarchy and how cache memory works is crucial for understanding how to build an efficient cache-aware data system. Hence, here, we will start from the basics of memory hierarchy, covering how caching works, what is an L3 and L2 shared cache, and what is an L1 private cache. We will discuss the differences between instruction and data caches and we will discuss how programs incur cache misses and how this affects performance.

Slides Code Section Recording

Development Tools

Development Tools

In this section we will discuss important development tools. We will talk about debugging tools including [c]gdb and valgrind, and the build tool Gnu make. We will do so by example. The example code is available in the git repository listed below.

Last year's notes on git, valgrind, gcc, and gdb Section Recording
Section Git repository

Editor tutorials
A guided tour of Emacs
Learn Vim Progressively
Getting started with Sublime Text 3

Navigating Ctags
Ctags with Vim
Ctags with Emacs
Setting up Ctags in Sublime Text

Additional resources

Automatic variables in Make
Secondary expansion in Makefiles
Last year's notes on git, valgrind, gcc, and gdb

Cache-Conscious Design

After having a clear understanding of memory hierarchy, in this section, we will discuss techniques that allow us to build cache-conscious algorithms. We will discuss how to minimize cache misses and how to avoid branch mispredictions by removing branches altogether from our code.

Slides Code Section Recording

System Performance (Profiling)

In this section we will address performance optimizations and techniques to build high performance code in the context of the project. We will discuss performance monitoring tools (perf) which allow us to know exactly where does the execution time goes and help us understand whether our implementation is efficient, and where are any possible performance bottleneck.

Handout Section Recording
Project sections

These are a series of sections for you to kickoff your class projects.

Introduction to the ML Systems Project

In this section we will introduce the basic knowledge, i.e., how a neural network is trained and why we need activation checkpointing, for the ML Systems project. We also show you the starter codebase for your to kickoff your own experiments.
Section Recording Section Slides

Introduction to the NoSQL Systems Project

In this section we will introduce the NoSQL Systems project and the example experimental designs. We also show you the suggested toolchains that you can kickoff your own experiments.
Section Recording Section Slides

Class Structure

Interaction in Every Class

While the instructor will do a few lectures through the semester, the class is going to be primarily discussion based. Think of this as an extended brainstorming session, a round table discussion about a specific problem in each class. The goal is to create the maximum possible interaction.

Our discussion will aim at bringing up design trends and tradeoffs, as well as algorithmic issues. Another significant part of our discussions will focus on examining open problems and to highlight opportunities for innovation.

At the very beginning of the semester the instructor will do 4-5 lectures to provide the necessary background. After that, each class will be based on a student presentation about a recent research paper which will work as a trigger for the day’s brainstorming. Depending on the needs of the class, the instructor will do additional lectures during class time or during our extra research sessions.

Office hours and labs

Interaction does not stop at lecture time. CS265 is designed to maximize interaction as we truly believe this is the best way to learn; we offer daily office hours and labs.

Starting Week 1, Prof. Idreos will hold office hours during the week and additional OH will be offered periodically during the weekend. Labs are offered by the TFs. Rooms and slots: TBA. The goal of labs is to get hands-on help for the projects (coding). Bring your laptop and your questions about specific project parts you need help with. Labs are the place to go when you have a persistent bug, when you need help with a specific tool for the project (e.g., for debugging or performance testing) or to get feedback about the quality of your coding.

Attendance

Based on the philosophy of the course, attendance in lectures, labs and office hours is optional. The best way to learn, though, is through discussion and interaction with the instructor and the TFs. Our classes are not about "lecturing" – they are about interaction. We hope to see you there!

Class Recording

All classes and interactive sessions in class will be recorded and will be available online. So even if you miss a class it will be easy to catch up and you can also use these recordings to recite specific material throughout the semester (e.g., to prepare for midterms).

Sections

Another component of the course is sections. Sections are used to deliver material about the class, i.e., to go more deeply into some of the concepts discussed in class, to do additional quizzes, or to deliver background material that is needed to follow next week’s class or for the project. There will be no actual section meeting. Instead, all sections will be recored by the teaching staff and videos will be posted online. The material posted will be tailored to present a step by step guide for any of the topics presented to make it easy to follow everything without having to be physically present in an actual section. However, if there are still questions about the material presented in sections, you will be able to ask those questions either during the daily office hours and labs.

Research Sessions

Throughout the semester, on select days the instructor, and DASlab PhDs and postdocs, will discuss about research! First, DASlab researchers will present their recent work on data systems research and connect it with the material you are learning in class. Then, you will get the chance to talk with them about their research, open problems and be exposed to open research opportunities. Snacks and drinks will be provided.

Weekly reviews

Each student will provide two paper reviews per week. This prepares you to be ready for the discussion in class. Reviews should be no more than two page long. Each review should have text for at least the following 9 points:

what is the problem?
why is it important?
why is it hard?
why existing solutions do not work?
what is the core intuition for the solution?
does the paper prove its claims?
what is the setup of analysis/experiments? is it sufficient?
are there any gaps in the logic/proof?
describe at least one possible next step.

Reviews should be no more than two pages long. PDF. Single column. 8pt font. 1 inch margins. Submission will be through Canvas. The deadline for each paper review is the starting time of the respective class. This is a hard deadline. The first four reviews will not be graded; we will use them only to provide feedback on the quality of the review and the grade that this review would get. Every second week we will have a special OH meeting to review the student reviews (we will do such meetings more frequently in the beginning of the semester to make everyone more familiar with the process early on).

Presentations

Each student will do at least one paper presentation during the semester. Presentations should follow similar guidelines as the guidelines for reviews. There should be 1-2 slides for each one of the nine core points in the review guidelines. In addition, there should be detailed slides that describe the core idea of the paper.

Your slides should not be a multiple sheets of bullet lists - in fact try to avoid bullet lists altogether - your slides should follow the generic formatting you will see in the first four lectures, that is: make slides as simple as possible - avoid text unless absolutely needed - no full phrases unless you need to give an exact definition of something - use figures and visual examples, one slide one message=each slide should have a single goal that you should be able to describe within a single phrase.

Your slides should be reviewed by the instructor at least 24 hours before the class you are presenting. The final deck of slides should be available 30 minutes before class so we can upload online. You are welcome to join for OH for help >> 1 while you prepare your slides!

Textbook

The class is about state-of-the-art data system design. There is no textbook for that. Thus, we use recent research papers and surveys which will be posted on the course website, which you will have access to through the Harvard network.

Feedback on progress

We provide feedback continuously. The main thing that you will need feedback on is your semester project and the paper reviews. The way to get feedback is to show up to our daily office hours and labs and share your design decisions, code, and test results with the staff. In this way, you will get hands-on help and feedback.

Specifically for reviews we will hold a special session every second week to "review the reviews"

Guest Lectures

Every semester we arrange a few guest lectures by leaders in data system design from industry and academia. Past guest lecturers in our classes include: Guy Lohman from IBM Research, Erietta Liarou from EPFL Lausanne, Alkis Simitsis and Georgia Koutrika from HP Labs, Nikita Shamgunov from MemSQL, Laura Haas from IBM Research, Nga Tran from Vertica, Jignesh Patel from University of Wisconsin, Johannes Gherke, from Microsoft, Marcin Zukowski from Snowflake, Richard Hipp from SQLite, Ryan Johnson from Logicblox.

You will get the opportunity to both attend a guest lecture and to actively participate in discussions with our guest speakers.

Logistics

Grading

Class discussions: 20%
Paper reviews: 15%
Paper presentation: 15%
Semester project: 35%
Midway check-in: 15%

Feedback

We welcome feedback and ideas about the course at any point during the semester. Just come and chat with us during office hours! Tell us how you are keeping up and how we can make it easier for you.

No Laptop/Phone Policy

CS265 is based on interaction. We want students actively participating in class and interactive sessions, asking and answering questions to maximize learning. In each class, we will bring a printed copy of the slides for each one of the students so you can follow along and to keep notes on paper. So you do not need your laptop or phones for notes or looking up the slides online. In fact, recent studies show that even if you only use a laptop for note taking, it can have a negative impact on how well you understand the material in class.

[The Pen Is Mightier Than the Keyboard: Advantages of Longhand Over Laptop Note Taking. Pam A. Mueller and Daniel M. Oppenheimer. Psychological Science. 2014, Vol. 25(6) 1159–1168]

There are cases where having a phone or laptop during class is necessary such as when you expect an important call or message or when you need the laptop to better follow the slides due to any issues with your eyes or ears. Just let the instructor know and all such cases will be granted permission to use any tools necessary.

Online Discussions

We will use ED Discussions as a forum for online discussions. The links are posted on the class website. You are welcome to post any question that might help you understand the material better or help you with the project. Anonymous posting (to other students) will be enabled so that students feel more comfortable posting.

BASIC RULES: We only have a few basic rules so we can keep the forum functional and useful for the students as well as manageable for the staff.

We ask that you first search the forum well before posting a question so that we do not have duplicate entries.
Please make sure to stay on top of all staff posts (especially those that are pinned). Anything we post at the forum we consider “known”.
Do not use the forum to post code or ask for help with debugging. While it can work in some cases remote debugging is a pain and takes a lot of time. We have labs every day. Bring your laptop and we will help you on-site, or join remotely and we will help you via a shared screen mode.
Do not use the forum for anything that is not about a technical question or a question about class logistics. If you want to discuss any concerns about your progress, fit for the class, or anything else you should come to OH.

Extension School

This section supplements the basic syllabus with additional details that apply to extension school students.

CS265 is a heavily research oriented course that is structured in a very different way than other classes, valuing and promoting critical thinking. For most students this requires a transitions phase. Please check the syllabus and requirements carefully before committing to this course.

In addition, keep in mind that taking this course successfully will in practice require participation in OH and Lab sessions. Even if they are not mandatory, they are critical for students to understand how to think about the material and how to design solutions. Especially if you do not have all the background described in the syllabus (i.e., if you have not taken a research oriented systems course with a heavy systems project), you should budget time for frequent participation in both Labs and OH and many hours of additional work every week to build the foundations needed.

Lecture: Lectures will be broadcasted live. Lectures will also be available for on- demand broadcast within 24 hours after each class. Students will be able to watch the live or recorded broadcast through their browser. The link to the broadcasts for CS265 will be available through the canvas website for this class and will also be posted on the class website before the first lecture.

Participation: Extension school students will be able to participate live in classes, office hours and labs via web-conference tools (we will use Zoom). The course staff will be online with Zoom during each session that is marked as “remote” and you will be able to actively interact with the staff. Other than standard chatting and talking features Zoom also offers screen sharing features which can be used for when you need help with specific issues such as debugging.

Capturing Discussions: Given that a big portion of the class is based on interaction, extension school in cooperation with the class staff is always working to set-up a system with several microphones across the classroom so we can accurately and clearly capture brainstorming discussions and comments during class time. Microphones will “follow” the instructor.

Grading: Even though we encourage extension school students to utilize the opportunity to interact with the staff and participate in class live we know that for practical reasons this will not be possible for all remote students. For this reason for extension school students there will be no “class participation” grade. The rest of the course is exactly the same as what local students do.

For this reason the portion of the class participation grade (20%) will be distributed to other class components and the final grade break down is as follows: project (50%), presentation (20%) and reviews (20%). The 10% for the midway check-in completes the grade distribution for extension school.

Discussion Forum:Given that remote students have usually a different set of needs due to the distance, there is a separate forum tailored to extension school. Look at the class website for the forum link for extension school students.

Office Hours and Labs: Extension school OH and Labs take place remotely during the weekend, which is typically more convenient for most students. In this way, we can have more flexibility to accommodate students with day jobs that cannot attend during the week. The schedule will be posted at the beginning of the semester on the class website and the forum.

Starting Date: Note that usually extension school shows the class starting date to be one day after the actual starting date. In fact, this is when the first video will be available. However, extension school students will still be able to stream live the first class on the first day of classes and participate live as normal. Check the class website for the exact schedule if you want to participate live.

Graduate Credit: Extension school students who take the course for graduate credit should provide a detailed literature review of NoSQL key-value stores. This is due at the end of the semester along with the project and will account for 30% of the project grade. There will be a separate announcement early in the semester with guidelines on how to complete the literature review. You are most welcome to ask for feedback along the way from the staff.

Staff

Prof. Stratos Idreos Instructor (Room: SEAS 4.411)
Utku SirinTeaching Fellow (Room: SEAS 4.435)
Qitong WangTeaching Fellow (Room: SEAS 4.435)

Important Links

Semester Project

Each student completes one semester project. There are two kinds of semester projects: 1) a systems project (ML System or NoSQL System), and 2) a research project.

ML Systems Project

The ML Systems Project for CS265 is designed to provide hands-on experience on the state-of-the-art systems for deep learning. It includes understanding the system architecture of modern deep learning frameworks, analyzing the compute memory trade-offs involved in training deep learning models and implementing an algorithm that navigates this trade-off. Systems projects will be done individually, each student is required to work on their own. This is a focused project that should not necessarily result in many lines of code (like the CS165 project), but will exercise your understanding of modern deep learning systems. The goal of this project is to implement an activation checkpointing algorithm in PyTorch. The project is structured into three stages: the first stage involves creating a profiler to gather performance metrics during the training process, the second stage involves implementing an algorithm that determines which activations to checkpoint based on the profiler's statistics, and the final stage requires modifying the execution strategy to implement the decisions made by the algorithm.

ML Systems Project: Activation Checkpointing »

NoSQL Systems Project

NoSQL Systems Project is tailored to provide background on state-of-the-art systems, data structures and algorithms. It includes a design component and an implementation component in C or C++, dealing with low level systems issues such as memory management, hardware conscious processing, parallel processing, managing read/write tradeoffs and scalability. This year’s NoSQL systems project is about designing and implementing a key-value store in the form of a Log Structured Tree that can accommodate fast reads and writes. The key-value store design we will follow will resemble the state-of-the-art design used as a basis in numerous modern key-value stores such as Facebook, LinkedIn, Cassandra, and many more. The first part of the project is about designing the basic structure of an LSM tree for reads and writes, while the second part is about designing and implementing the same functionality in a parallel way so we can support multiple concurrent reads and writes. This is a focused project that while it is not extremely heavy in terms of how much code you have to write it will bring you against basic modern system design issues and tradeoffs. We will upload a detailed description on the class website before the beginning of the semester. Systems projects will be done individually, i.e., each student will have to work on the project on their own.

NoSQL Systems Project: LSM Trees »

Research Projects

The research project, on the other hand, is much more tailored on design and proof of concept implementations trying to solve open problems. Research projects are tailored to give a taste of research to students and lead to publications. When working on a research project, students will work closely with the instructor and members of DASlab on active research projects of the lab. Students will work on groups of three. Such projects are mainly about thinking, reading and writing and much less about coding although proof of concept implementations will be our end target in some cases.

This year we will be working on the following research projects:

Research projects will be offered to students who have taken CS165 in the past and students who already have significant systems background. This will be done in consultation with the instructor.

In early February we will hold a special class to introduce both the systems project and the research projects in detail and this will be followed by a series of OH for clarifications. In the meantime students may browse the daslab website and learn more about the projects going on: http://daslab.seas.harvard.edu/, and the class website for examples of projects from past years.

In special cases where a student wants to work on an alternative research project, i.e., a project which is inspired by existing research that the student is already doing (e.g., as part of a PhD for a grad student or a continuation of the CS165 project for an undergrad) we will work to accommodate such requests on a case by case basis. This will be done in consultation with the instructor and only if students probably would not benefit from doing a systems project as they know this material already. Assuming there is a strong plan and drive for a specific project, such requests will most likely be granted.

What is a successful project?

For systems projects we will give out specific functionality and performance metrics you have to achieve as part of the description of the project. For research projects we will give out specific questions you need to answer when we set-up each individual research project.

Evaluation

There is no final or midterms. At the end of the semester each student will have a meeting with the instructor and another meeting with the TFs where students will demonstrate their projects and answer design questions about the project. [Tip: Past experience shows that frequent participation in office hours, brainstorming sessions and sections means that the instructor and the TFs are very well aware of your system and your progress which makes the final evaluation a mere formality for these cases.]

Collaboration Policy

The systems project is an individual project: the final deliverable should be personal, you must write from scratch all the code of your system and all documentation and reports. Discussing the design and implementation problems with other students is allowed and encouraged! We will do so in the class and during office hours and brainstorming sessions. Research projects are going to be in groups of three and similar to the systems project we encourage discussions across teams but in the end each team should deliver a project that is clearly theirs.

Late Day Policy

All projects are due at the end of the semester and this is when they will be graded. The more input you give us, through the semester though, the more we can help you learn. In the systems project description you can find a detailed time- schedule that we propose you follow. Similarly, we will set up specific timelines for each research project. All timelines represent an ideal plan and you have the freedom to adjust according to your schedule.

There are no late days for reviews. This is because reviews are essential for you to follow each class.

Note: Experience says that every year a number of students cannot handle the freedom to self-pace, and end up significantly deviating from the schedule. We will send you frequent reminders but you should know that deviating from the schedule by more than a couple of weeks will most likely mean that you will not be able to finish the whole project by the end of the semester (unless you are an experienced systems student).

Midway Check-in

The goal here is to demonstrate that you are having decent progress and mainly to avoid falling behind. By early March each student working on a systems project should deliver 1) a design document, 2) a 45-minute presentation that describes the intended design for the whole project and, 3) at least two performance experiments that demonstrate early results (10%). A template of the expected design document will be provided early in the semester.

Previous Research Projects

Below you can find some highlighted research projects from previous years that can serve as inspiration of what to expect from the research project.

Many of our students in the past have successfully engaged in research projects with DASlab and published research papers. So far 11 undergraduate DASlab teams have made it to the finals of the ACM SIGMOD Undergraduate Research Competition. In 2016 we won first place with the work on adaptive denormalization, in 2017 we won first place with the work on evolving trees, in 2018 we won first place with the work on Splaying LSM-Trees, in 2019 we won first place with the work on LSM-Trees and B-Trees: The Best of Both Worlds, in 2020 we won first place with the work on Accurate Latency Prediction for Key-Value Storage Engines, in 2021 we won third place with the work on Learning Algorithms for Automatic Data Structure Design, and in 2022 we won first place with the work on Workload-Adaptive Filtering in Storage Engines.

Talk to the instructor at any point if you are excited about pursuing independent research during or after the course.

Overview

Class Philosophy

It's all about research

Learn to question everything

Discuss and develop ideas

Read, understand, review & improve state-of-the-art research

Lectures

FAQ

Sections

Development Tools

Editor tutorials

Navigating Ctags

Additional resources

Class Structure

Logistics

Staff

Important Links

Semester Project

ML Systems Project

NoSQL Systems Project

Research Projects

Previous Research Projects

First place SIGMOD UG Competition 2022

Workload-Adaptive Filtering in Storage Engines

Third place SIGMOD UG Competition 2021

Learning Algorithms for Automatic Data Structure Design.

First place SIGMOD UG Competition 2020

From Worst-Case to Average-Case Analysis: Accurate Latency Predictions for Key-Value Storage Engines.

Second place SIGMOD UG Competition 2020

MemFlow: Memory Aware Distributed Deep Learning

First place SIGMOD UG Competition 2019

LSM-Trees and B-Trees: The Best of Both Worlds

First place SIGMOD UG Competition 2018

Splaying LSM-Trees

First place SIGMOD UG Competition 2017

Evolving Trees

First place SIGMOD UG Competition 2016

Adaptive Denormalization

.

Adaptive Data Skipping

.

One Loop Does Not Fit All

.

Near-Data Processing