CS265

Overview

Big data is everywhere. A fundamental goal across numerous modern businesses and sciences is to be able to exploit as many machines as possible, to consume as much information as possible and as fast as possible. The big challenge is "how to turn data into useful knowledge". This is far from a simple task and a moving target as both the underlying hardware and our ability to collect data evolve. In this class, we will discuss how to design data systems and algorithms that can "scale up" and "scale out". Scale up refers to the ability to use a single machine to all its potential, i.e., to exploit properly the memory hierarchy and the multiple CPU and GPU cores. Scale out refers to the ability to use more than 1 machines (typically 100s or 1000s) effectively. We will use examples from several areas, including relational systems and distributed databases, graph processing systems (i.e., for social networks), key value stores, noSQL and newSQL systems as well as mobile computing and interactive analytics. In a fast moving industry and research environment such skills are in high demand.

Term
Spring 2018
Classes
Wednesday/Friday 4:00-5:30 PM
Join Class and Remote OH Live
https://zoom.us/j/9063672373
Level
Graduate (open to undergraduate students)
Office Hours
Stratos
MD139 @ 3-4 PM (W/Th)
Stratos
Remote @ 3-4 PM (Fr)
Kostas
Remote @ 2 - 4 PM (Sun)
Mike
MD136 @ 4 - 5 PM (Th)
Mike
Remote @ 4 - 5 PM (Sat)
Wasay
MD136 @ 5:45 - 6:45 PM (Tue, W)
Brian
MD136 @ 7 - 8 PM (M)
Reviewing the Reviews
MD136, Friday: 5:30 - 6:30 PM (For Remote and Local Students)

Class Philosophy

CS265 has unlimited office hours, unlimited late days for deliverables, relies on the latest research papers instead of a standard text book, lectures are based on interaction and discussion instead of just lecturing and most of all it is fun! The instructor and the TFs are here to help you all days and at all times through out the semester. You may request as many meetings as you like and as much help as you want.

The class is also geared towards engaging creative thinking and problem solving to give students a feeling of how computer science research takes place. Many of our students in the past have successfully engaged in research projects with DASlab and published research papers.

While the instructor will do a few lectures through the semester, the class is going to be primarily discussion based. Think of this as an extended brainstorming session, a round table discussion about a specific problem in each class. The goal is to create the maximum possible interaction. Our discussion will aim at bringing up design trends and tradeoffs, as well as algorithmic issues. Another significant part of our discussions will focus on examining open problems and to highlight opportunities for innovation.

Classes

  1. In this class we will discuss the basics of data systems and the goals and structure of the course.

    Download Slides

    In this class we will discuss about modern data systems architectures to present the basics of modern relational systems, graph systems, key-value stores and stream systems

    Download Slides

    In this class we will continue the discussion about modern data systems architectures to present the basics of modern relational systems, graph systems, key-value stores and stream systems.

    Download Slides

    In this class, students will be introduced to research and systems projects.

    Download Slides
  2. (P) Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation. Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, Alfons Kemper. ACM SIGMOD International Conference on Management of Data. 2016

    (B) Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. Viktor Leis, Peter A. Boncz, Alfons Kemper, Thomas Neumann. ACM SIGMOD International Conference on Management of Data. 2014

    (B) MonetDB/X100: Hyper-Pipelining Query Execution. Peter A. Boncz, Marcin Zukowski, Niels Nes Conference on Innovative Data Systems Research (CIDR), 2005

    Download Slides

    (P) VectorH: Taking SQL-on-Hadoop to the Next Level. Andrei Costea, Adrian Ionescu, Bogdan Raducanu, Michal Switakowski, Cristian Bârca, Juliusz Sompolski, Alicja Luszczak, Michal Szafranski, Giel de Nijs, Peter A. Boncz. ACM SIGMOD International Conference on Management of Data. 2016

    (B) SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Avrilia Floratou, Umar Farooq Minhas, Fatma Özcan. Proceedings of the Very Large Databases Endowment (PVLDB), 2014

    (B) HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, Avi Silberschatz. Proceedings of the Very Large Databases Endowment (PVLDB), 2009

    Download Slides

    (P) H2O: A Hands-free Adaptive Store. Ioannis Alagiannis, Stratos Idreos and Anastassia Ailamaki. ACM SIGMOD International Conference on Data Management, 2014

    (B) Efficiently Compiling Efficient Query Plans for Modern Hardware. Thomas Neumann. Proceedings of the Very Large Databases Endowment (PVLDB), 2011

    Download Slides

    (P) Fast Scans on Key-Value Stores. Markus Pilman, Kevin Bocksrocker, Lucas Braun, Renato Marroquín, Donald Kossmann International Conference on Very Large Databases (VLDB), 2017

    (B) On the design and scalability of distributed shared-data databases. S. Loesing, M. Pilman, T. Etter, and D. Kossmann. ACM SIGMOD International Conference on Data Management, 2015

    Download Slides

    (P) The adaptive radix tree: ARTful indexing for main-memory databases. Viktor Leis, Alfons Kemper, Thomas Neumann. International Conference on Data Engineering (ICDE), 2013

    (B) A study of index structures for main memory database management systems. T. J. Lehman and M. J. Carey. International Conference on Very Large Databases (VLDB),1986

    (B) Cache conscious indexing for decision-support in main memory. J. Rao and K. Ross. International Conference on Very Large Databases (VLDB), 1999

    Download Slides

    (P) The TileDB Array Data Storage Manager. Stavros Papadopoulos, Kushal Datta, Samuel Madden, Timothy Mattson. International Conference on Very Large Databases (VLDB), 2017

    (B) Overview of SciDB: Large Scale Array Storage, Processing and Analysis. P. G. Brown. ACM SIGMOD International Conference on Data Management, 2010

    Download Slides
  3. (P) Skipping-oriented Partitioning for Columnar Layouts. Liwen Sun, Michael Franklin, Jiannan Wang, Eugene Wu. International Conference on Very Large Databases (VLDB), 2016

    (B) Small Materialized Aggregates: A Light Weight Index for Data Warehousing. Guido Moerkotte. International Conference on Very Large Databases (VLDB),1998

    Download Slides

    (P) The End of a Myth: Distributed Transaction Can Scale. Erfan Zamanian, Carsten Binnig, Tim Kraska, Tim Harris Proceedings of the Very Large Databases Endowment (PVLDB), 2017

    (B) The end of slow networks: It’s time for a redesign. Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska, Erfan Zamanian. ACM SIGMOD International Conference on Data Management, 2016

    Download Slides

    Niv Dayan

    (P) GeckoFTL: Scalable Flash Translation Techniques For Very Large Flash Devices. Niv Dayan, Philippe Bonnet, and Stratos Idreos. ACM International Conference on Data Management, SIGMOD, 2016

    Download Slides

    (P) OrpheusDB: Bolt-on Versioning for Relational Databases. Silu Huang, Liqi Xu, Jialin Liu, Aaron J. Elmore, Aditya Parameswaran Proceedings of the Very Large Databases Endowment (PVLDB), 2017

    (B) Decibel: The relational dataset branching system. Michael Maddox, David Goehring, Aaron J. Elmore, Sam Madden, Aditya Parameswaran, Amol Deshpande. ACM SIGMOD International Conference on Data Management, 2016

    Download Slides

    (P) MaSM: Efficient Online Updates in Data Warehouses. Manos Athanassoulis, Shimin Chen, Anastasia Ailamaki, Phillip B. Gibbons, and Radu Stoica, ACM SIGMOD International Conference on Data Management, 2011

    Download Slides

    (P) Ground: A Data Context Service. Joseph M. Hellerstein, Vikram Sreekanti, Joseph E. Gonzales, Sudhansku, Arora, Arka Bhattacharyya, Shirshanka Das, Akon Dey, Mark Donsky, Gabriel Fierro, Sreyashi Nag, Krishna Ramachandran, Chang She, Eric Sun, Carl Steinbach, Venkat Subramanian Conference on Innovative Data Systems Research (CIDR), 2017

    (B) ProvDB: Lifecycle Management of Collaborative Analysis Workflows. Hui Mao, Amit Chavan, Amol Deshpande. Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics (HILDA), 2017

    Download Slides

    (P) MacroBase: Prioritizing Attention in Fast Data. Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. ACM SIGMOD International Conference on Data Management, 2017

    (B) P Scorpion: Explaining away outliers in aggregate queries. Eugene Wu and Samuel Madden. International Conference on Very Large Databases (VLDB), 2013

    Download Slides

    (P) RUMA has it: Rewired User-space Memory Access is Possible!. Felix M. Schuhknecht, Jens Dittrich, and Ankur Sharma. Proceedings of the Very Large Databases Endowment (PVLDB), 2017, 2017

    (Read as necessary) What Every Programmer Should Know About Memory. Ulrich Drepper. 2007

    Download Slides

FAQ

Data systems are literally everywhere. We are using them directly or indirectly every day all day long for numerous basic or not so basic tasks, e.g., when we are buying coffee to when we are booking airplane tickets. They provide the backbone of all modern businesses to manage their data and of course they provide the backbone of online businesses and environments such as social networks and search engines. They are also used increasingly in science as data analytics becomes more and more the fundamental barrier in generating knowledge.

This class is not a traditional introduction on how we use a database system and how to write SQL. Instead, this is a systems class about data system design. You will learn how big data systems work at their core and how to design new systems for emerging applications and hardware. By the way, if you know how systems work, you also become better at using them!

Data is everywhere. Every year we create even more data. As it stands, every two days we create as much data as much we created from the dawn of humanity up to 2003 [Eric Schmidt, Google]. Sciences, businesses, and everyday life are substantially affected. Data systems are in the middle of all this. Data systems are how we store and access data, i.e., they are the backbone of any data-driven application. It is a $100B industry, growing 10% every year [Economist, “Data, data everywhere”].

At the same time data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, there is a continuous need to redefine what is a good data system design in such a dynamic environment.

CS265 exposes students to the core internals of data systems making it possible to understand core trends in system design and to be one of the few who know how to design and evaluate systems. In addition, due to the way the course is taught (focus on interactive problem solving, open topics and the latest research results) this is also a great class for those who want to understand what CS research is all about and how to engage in doing research.

  1. Learn state-of-the-art research and industry trends in big data systems.
  2. Understand the tradeoffs in designing and implementing modern big data systems.
  3. Be able to make design decisions in big data driven scenarios.
  4. Develop basic research skills: reading, writing and understanding research papers.
  5. Deepen C programming, debugging, and performance profiling skills.

Efficient data analytics and system design is all about how we store and access the data. In this class, you are going to see how the same concepts appear again and again in numerous data-driven scenarios.

If you took and liked CS165, you will like CS265 as well. From a material point of view CS265 moves on to consider additional topics as a continuation of CS165 such as distributed processing, transaction processing, graph processing and more. In terms of the way the class is taught, it is even more interactive, and even more research oriented. That is because semester projects are actually on open research problems with the potential to lead to a publication and every class is focused on a single research paper, and understanding it in detail.

You may have heard stories about CS165 and wondering if CS265 is going to be equally hard or you may have taken CS165 and wondering if this is going to be a similar amount of work. CS165 and CS265 are different style of classes. While CS165 is much more focused on implementation leading to a full system prototype, CS265 is more focused on ideas and design. In other words, you may have written 5-10K lines of code (some even more!) for CS165 but in CS265 you are more likely going to write small amounts of code and mostly play with alternative ways to design a specific functionality, structure or algorithm to highlight the effect of different choices and to find out new ways to solve a specific problem.


IF you have taken CS165, CS161, CS261
  GOTO next question;
ELSE
  see below;

Background: Naturally, the more background you have the smoother your experience in 265 will be. Prior knowledge of C programming and systems programming, as well as a good understanding of computer architecture and in particular the memory hierarchy (cache memories) is very important for this class. Courses providing systems background (like CS50 and in particular CS61 or equivalent) are essential. Good hacking, algorithm designing, and data structures skills are also required.

If you are graduate student and have taken a mix of systems (database, operating and distributed systems) classes in the past, then you will be OK and we will provide enough background so you can follow. CS265 does satisfy the systems requirement towards a PhD.

If you are a senior in college and this is your last chance to take this class: if you have taken CS61 but no CS161 or CS165 then talk to the instructor to evaluate how fit your are for the class. If you have not taken CS61 but do have significant systems programming experience you may still qualify.

In all other cases, it is a better idea to take CS165 first.

Just utilize all resources provided. Show up in class to participate in interactive sessions. There are also daily office hours and labs; show up as often as possible so we can help with anything you need! When you find yourself stuck with the project either with a design decision or just a bug, it is normal to struggle for a while — it is part of the learning process — but after some time grab your laptop and come by!

Especially if you have not taken CS165 it is a good idea to spend some time preparing before the semester starts and during the early weeks of the semester even if you consider yourself an expert systems student. The best approach is to browse some fundamental readings in data systems architectures. We propose that you take a look at the following texts from the CS165 readings:

  1. Get familiar with the very basics of traditional database architectures: Architecture of a Database System. By J. Hellerstein, M. Stonebraker and J. Hamilton. Foundations and Trends in Databases, 2007
  2. Get familiar with very basics of modern database architectures: The Design and Implementation of Modern Column-store Database Systems. By D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden. Foundations and Trends in Databases, 2013
  3. Get familiar with the very basics of modern large scale systems: Massively Parallel Databases and MapReduce Systems. By Shivnath Babu and Herodotos Herodotou. Foundations and Trends in Databases, 2013

Project 0 We provide a Project 0 that is designed to 1) help you get an idea about how fit you are for the class and 2) bootstrap your C coding skills. Essentially Project 0 consists of an independent data structure design and implementation in C that will allow you to practice basic system design, coding and debugging skills. In addition, several fundamental section videos are posted on the class website about system coding and profiling to help you with that.

Lectures will be broadcasted live Wednesdays/Fridays 4-5:30pm. Lectures will also be available for on-demand broadcast within 24 hours after each class. Students will be able to watch the live or recorded broadcast through their browser using the Matterhorn player. The link to the broadcasts for CS265 will be available through the canvas website for this class and will also be posted on the class website before the first lecture.

Extension school students will be able to participate live in classes, office hours and labs via web-conference tools (we will use Zoom). The course staff will be online with Zoom during each session and you will be able to actively interact with the staff. Other than standard chatting and talking features Zoom also offers screen sharing features which can be used for when you need help with specific issues such as debugging. Here is a link that explains how to install and use Zoom during class, Labs and OH.

Capturing Discussions: Given that a big portion of the class is based on interaction, extension school in cooperation with the class staff is working to set-up a system with several microphones across the classroom so we can accurately and clearly capture brainstorming discussions and comments during class time. Microphones will &ldquot;follow&rdquot; the instructor.

Grading: Even though we encourage extension school students to utilize the opportunity to interact with the staff and participate in class live we know that for practical reasons this will not be possible for all remote students. For this reason for extension school students there will be no “class participation” grade and the portion of this grade (20%) will be distributed in project (50%), presentation (20%) and reviews (20%). 10% for the midway check-in completes the grade distribution for extension school.

Piazza: To participate in piazza you need a Harvard email address. If you do not have one you can create one by clicking here.

Office Hours and Labs: If none of the existing slots for office hours and labs do not work (e.g., due to time differences), we will include additional slots; just let us know.

Starting Date: Note that usually extension school shows the class starting date to be one day after the actual starting date (which is Wednesday January 24, at 4pm). In fact, this is when the first video will be available. However, extension school students will still be able to stream live the first class on January 24 and participate live as normal.

You are responsible for understanding Harvard and Harvard Extension School policies on academic integrity and how to use sources responsibly. Not knowing the rules, misunderstanding the rules, running out of time, submitting "the wrong draft", or being overwhelmed with multiple demands are not acceptable excuses. There are no excuses for failure to uphold academic integrity. To support your learning about academic citation rules, please visit the Harvard Extension School Tips to Avoid Plagiarism, where you'll find links to the Harvard Guide to Using Sources and two, free, online 15-minute tutorials to test your knowledge of academic citation policy. The tutorials are anonymous open-learning tools.

Harvard and the Extension School are committed to providing an accessible academic community. The Disability Services Office offers a variety of accommodations and services to students with documented disabilities. Please visit http://www.extension.harvard.edu/resources-policies/resources/disability-servicesaccessibility for more information and do not hesitate to contact prof. Idreos directly, by email, with any questions or concerns you might have.

Sections

Sections are offered only online as pre-recorded videos. We will be adding section videos through the semester based on the needs of the students. If you have questions for any one of the sections come by one of our daily OH and Labs.

  1. These are a series of sections from CS165 which are useful for students who do not have prior experience or need to refresh their knowledge in C, and development tools needed for C programming.

    In this section we will cover the basic knowledge of C required for this project. We will discuss the main features of the language and will focus on pointer arithmetics a key challenge when programming in C.

    Section Recording

    This section introduces Project 0 (warm-up exercise), a standalone programming exercise that has two goals. First, it will help the students to understand the coding effort and skills needed to carry on the full project, by implementing a hash table. Second, it will be an integral part that can be used as-is in the full project.

    Section Recording

    The section of this week is dedicated to memory hierarchy. Understanding memory hierarchy and how cache memory works is crucial for understanding how to build an efficient cache-aware data system. Hence, here, we will start from the basics of memory hierarchy, covering how caching works, what is an L3 and L2 shared cache, and what is an L1 private cache. We will discuss the differences between instruction and data caches and we will discuss how programs incur cache misses and how this affects performance.

    Slides Code Section Recording

    Development Tools

    In this section we will discuss important development tools. We will talk about debugging tools including [c]gdb and valgrind, and the build tool Gnu make. We will do so by example. The example code is available in the git repository listed below.

    Last year's notes on git, valgrind, gcc, and gdb Section Recording

    Section Git repository
     
    Editor tutorials
    A guided tour of Emacs
    Learn Vim Progressively
    Getting started with Sublime Text 3
     
    Navigating Ctags
    Ctags with Vim
    Ctags with Emacs
    Setting up Ctags in Sublime Text
     
    Additional resources

    Automatic variables in Make
    Secondary expansion in Makefiles
    Last year's notes on git, valgrind, gcc, and gdb

    After having a clear understanding of memory hierarchy, in this section, we will discuss techniques that allow us to build cache-conscious algorithms. We will discuss how to minimize cache misses and how to avoid branch mispredictions by removing branches altogether from our code.

    Slides Code Section Recording

    In this section we will address performance optimizations and techniques to build high performance code in the context of the project. We will discuss performance monitoring tools (perf) which allow us to know exactly where does the execution time goes and help us understand whether our implementation is efficient, and where are any possible performance bottleneck.

    Handout Section Recording

Class Structure

Interaction in every class
While the instructor will do a few lectures through the semester, the class is going to be primarily discussion based. Think of this as an extended brainstorming session, a round table discussion about a specific problem in each class. The goal is to create the maximum possible interaction.

Our discussion will aim at bringing up design trends and tradeoffs, as well as algorithmic issues. Another significant part of our discussions will focus on examining open problems and to highlight opportunities for innovation.

At the very beginning of the semester the instructor will do 4-5 lectures to provide the necessary background. After that, each class will be based on a student presentation about a recent research paper which will work as a trigger for the day’s brainstorming. Depending on the needs of the class, the instructor will do additional lectures during class time or during our extra research sessions.

Attendance
Based on the philosophy of the course, attendance in lectures, labs and office hours is optional. The best way to learn, though, is through discussion and interaction with the instructor and the TFs. Our classes are not about “lecturing” – they are about interaction. We hope to see you there!

Class recordings
All classes and interactive sessions in class will be recorded and will be available online. So even if you miss a class it will be easy to catch up and you can also use these recordings to recite specific material throughout the semester (e.g., to prepare for midterms).

No laptop/phone policy
CS265 is based on interaction. We want students actively participating in class and interactive sessions, asking and answering questions to maximize learning. In each class, we will bring a printed copy of the slides for each one of the students so you can follow along and to keep notes on paper. So you do not need your laptop or phones for notes or looking up the slides online. In fact, recent studies show that even if you only use a laptop for note taking, it can have a negative impact on how well you understand the material in class1.

Guest lectures
Every semester we arrange a few guest lectures by leaders in data system design from industry and academia. Past guest lecturers in our classes include: Guy Lohman from IBM Research, Erietta Liarou from EPFL Lausanne, Alkis Simitsis and Georgia Koutrika from HP Labs, Nikita Shamgunov from MemSQL, Laura Haas from IBM Research, Nga Tran from Vertica, Jignesh Patel from University of Wisconsin, Johannes Gherke, from Microsoft and Marcin Zukowski from Snowflake.

You will get the opportunity to both attend a guest lecture and to actively participate in discussions with our guest speakers.

1. [The Pen Is Mightier Than the Keyboard: Advantages of Longhand Over Laptop Note Taking. Pam A. Mueller and Daniel M. Oppenheimer. Psychological Science. 2014, Vol. 25(6) 1159–1168]

Another component of the course is sections. Sections are used to deliver material about the class, i.e., to go more deeply into some of the concepts discussed in class, to do additional quizzes, or to deliver background material that is needed to follow next week’s class or for the project. There will be no actual section meeting. Instead, all sections will be recorded by the teaching staff and videos will be posted online. The material posted will be tailored to present a step by step guide for any of the topics presented to make it easy to follow everything without having to be physically present in an actual section. However, if there are still questions about the material presented in sections, you will be able to ask those questions either during the daily office hours and labs.

Throughout the semester, on select Tuesdays evenings the instructor, and DASlab PhDs and postdocs will discuss about research! First, DASlab researchers will present their recent work on data systems research and connect it with the material you are learning in class. Then, you will get the chance to talk with them about their research, open problems and be exposed to open research opportunities. Snacks and drinks will be provided.

The class is about state-of-the-art data system design. There is no textbook for that. Thus, we use recent research papers and surveys which will be posted on the course website, which you will have access to through the Harvard network.

We provide feedback continuously. The main thing that you will need feedback on is your semester project and the paper reviews. The way to get feedback is to show up to our daily office hours and labs and share your design decisions, code, and test results with the staff. In this way, you will get hands-on help and feedback.

Specifically for reviews we will hold a special session every second week to “review the reviews”

Each student will provide two paper reviews per week. This prepares you to be ready for the discussion in class. Each review should have text for at least the following 9 points (please make sure you follow the question answer format for the reviews):

  1. what is the problem?
  2. why is it important?
  3. why is it hard?
  4. why existing solutions do not work?
  5. what is the core intuition for the solution?
  6. does the paper prove its claims?
  7. what is the setup of analysis/experiments? is it sufficient?
  8. are there any gaps in the logic/proof?
  9. describe at least one possible next step.

Reviews should be no more than two pages long. PDF. Single column. 8pt font. 1 inch margins. Submission will be through Canvas. The deadline for each paper review is the starting time of the respective class. This is a hard deadline. The first four reviews will not be graded; we will use them only to provide feedback on the quality of the review and the grade that this review would get. Every second week we will have a special OH meeting to review the student reviews.

Each student will do at least one paper presentation during the semester. Presentations should follow similar guidelines as the guidelines for reviews. There should be 1-2 slides for each one of the nine core points in the review guidelines. In addition, there should be detailed slides that describe the core idea of the paper.

Your slides should not be a multiple sheets of bullet lists - in fact try to avoid bullet lists altogether - your slides should follow the generic formatting you will see in the first four lectures, that is: make slides as simple as possible - avoid text unless absolutely needed - no full phrases unless you need to give an exact definition of something - use figures and visual examples, one slide one message=each slide should have a single goal that you should be able to describe within a single phrase.

Your slides should be reviewed by the instructor during OH at least 24 hours before the class you are presenting. The final deck of slides should be available 30 minutes before class so we can upload online. You are welcome to join for OH for help while you prepare your slides!

Register for a slot here.

Logistics

The class meets twice a week: Wednesdays and Fridays 4:00-5:30pm. Room MDG125. Class starts at 4:10pm. You might be thinking that this is a weird day and time-slot. The idea is to minimize conflicts with other classes.

Interaction does not stop at lecture time. CS265 is designed to maximize interaction as we truly believe this is the best way to learn; we offer daily office hours and labs.

Starting Week 1, Prof. Idreos will hold office hours every W/T/F 3-4pm in his office, MD139. Additional OH will be offered periodically during the weekend. Labs are offered by the TFs (rooms and slots TBA). The goal of labs is to get hands-on help for the projects (coding). Bring your laptop and your questions about specific project parts you need help with. Labs are the place to go when you have a persistent bug, when you need help with a specific tool for the project (e.g., for debugging or performance testing) or to get feedback about the quality of your coding.

We will also offer extra weekend office hours and labs as needed.

  • Class discussions: 20%
  • Paper reviews: 15%
  • Paper presentation: 15%
  • Semester project: 40%
  • Midway check-in: 10%

We will use Piazza for online discussions. The links are posted in the Quicklinks menu.

We continuously monitor Piazza and will be answering your questions promptly. In past offerings the average response time was in the order of a few minutes. So you basically have access to the teaching staff all day long. You are welcome to post any question that might help you understand the material better or help you with the project. Anonymous posting (to the other students) will be enabled so that students feel more comfortable posting questions.

BASIC RULES FOR PIAZZA: We only have a few basic rules so we can keep the forum functional and useful for the students as well as manageable for the staff.

  1. We ask that you first search the forum well before posting a question so that we do not have duplicate entries.
  2. Please make sure to stay on top of all staff posts (especially those that are pinned). Anything we post in Piazza we consider “known”.
  3. Do not use Piazza to post code or ask help with debugging. While it can work in some cases remote debugging is a pain and takes a lot of time. We have labs every day. Bring your laptop and we will help you on site or join remotely and we will help you via a shared screen mode.
  4. Do not use piazza for anything that is not about a technical question or a question abut class logistics. If you want to discuss any concerns about your progress, fitness for the class or anything else you should come to OH.

Staff

Projects

Every student is expected to complete and present a substantial class project during the semester. Class projects can take the form of either a systems or a research project. We outline the differences between the two categories below.

Systems projects are tailored to provide background on state-of-the-art systems, data structures and algorithms. They include a design component and an implementation component in C or C++, dealing with low level systems issues such as memory management, hardware conscious processing, parallel processing, managing read/write tradeoffs and scalability. This year’s systems project is about designing and implementing a key-value store in the form of a Log Structured Tree that can accommodate fast reads and writes. The key-value store design will resemble the state-of-the-art design used as a basis in numerous modern key-value stores such as Facebook, LinkedIn, Cassandra, and many more. The first part of the project is about designing the basic structure of an LSM tree for reads and writes, while the second part is about designing and implementing the same functionality in a parallel way so we can support multiple concurrent reads and writes. This is a focused project that while it is not extremely heavy in terms of how much code you have to write it will bring you against basic modern system design issues and tradeoffs. We will upload a detailed description on the class website before the beginning of the semester. Systems projects will be done individually, i.e., each student will have to work on the project on their own.

Systems Project: LSM Trees »

The research project, on the other hand, is much more tailored on design and proof of concept implementations trying to solve open problems. Research projects are tailored to give a taste of research to students and lead to publications. When working on a research project, students will work closely with the instructor and members of DASlab on active research projects of the lab. Students will work on groups of three. Such projects are mainly about thinking, reading and writing and much less about coding although proof of concept implementations will be our end target.

This year we will be working on the following research projects:

  1. Self-designing Data Systems
  2. Designing Access Methods and Balancing Tradeoffs
  3. CrimsonDB: A Self-Designing Key-Value Store

Research projects will be offered to students who have taken CS165 in the past and students who already have significant systems background. This will be done in consultation with the instructor.

In mid February we will hold a special class to introduce both the systems project and the research projects in detail and this will be followed by a series of OH for clarifications. In the meantime students may browse the websites of the two research projects to get an idea of the work involved:

  1. http://daslab.seas.harvard.edu/evolution/
  2. http://daslab.seas.harvard.edu/rum-conjecture/
  3. http://daslab.seas.harvard.edu/crimsondb/

and the class website for examples of projects from past years.

For systems projects we will give out specific functionality and performance metrics you have to achieve as part of the description of the project. For research projects we will give out specific questions you need to answer when we set-up each individual research project.

In special cases where a student wants to work on an alternative research project, i.e., a project which is inspired by existing research that the student is already doing (e.g., as part of a PhD for a grad student or a continuation of the CS165 project for an undergrad) we will work to accommodate such requests on a case by case basis. This will be done in consultation with the instructor. Assuming there is a strong plan and drive for a specific project, such requests will most likely be granted.

There is no final or midterms. At the end of the semester each student will have a meeting with the instructor and another meeting with the TFs where students will demonstrate their projects and answer design questions about the project. [Tip: Past experience shows that frequent participation in office hours, brainstorming sessions and sections means that the instructor and the TFs are very well aware of your system and your progress which makes the final evaluation a mere formality for these cases.]

The systems project is an individual project: the final deliverable should be personal, you must write from scratch all the code of your system and all documentation and reports. Discussing the design and implementation problems with other students is allowed and encouraged! We will do so in the class and during office hours and brainstorming sessions. Research projects are going to be in groups of three and similar to the systems project we encourage discussions across teams but in the end each team should deliver a project that is clearly theirs.

All projects are due at the end of the semester and this is when they will be graded. The more input you give us, through the semester though, the more we can help you learn. In the systems project description you can find a detailed timeschedule that we propose you follow. Similarly, we will set up specific timelines for each research project. All timelines represent an ideal plan and you have the freedom to adjust according to your schedule.

There are no late days for reviews. This is because reviews are essential for you to follow each class.

Note: Experience says that every year a number of students cannot handle the freedom to self-pace, and end up significantly deviating from the schedule. We will send you frequent reminders but you should know that deviating from the schedule by more than a couple of weeks will most likely mean that you will not be able to finish the whole project by the end of the semester (unless you are an experienced systems student).

The goal here is to demonstrate that you are having decent progress and mainly to avoid falling behind. By March 9 each student should 1) deliver a design document and 10 minute presentation that describes the intended design for the whole project and at least one performance experiment that demonstrates an early result (10%). A template of the expected design document will be provided early in the semester.

Previous Research Projects

Below you can find some highlighted research projects from previous years that can serve as inspiration of what to expect from the research project.

Many of our students in the past have successfully engaged in research projects with DASlab and published research papers based on their CS165/265 project. So far, five students have made it to the finals of the ACM SIGMOD Undergrad Research Competition (2015 through 2017). In both 2016 and 2017, we won first place with the work on adaptive denormalization and evolving trees respectively. In addition, one of the projects won a CRA honorable mention in 2016. The project and the classes will give you plenty of triggers on new problems to work on.