CS165
Have fun learning to design and build modern data systems
Class: Mon/Wed 4-5:30 pm, Pierce 301  Office hours by Stratos: Mon/Wed 5:30-6:30pm, Tue/Thu/Fri 3-4pm, MD 139 
Labs: Mon 6:30-7:30 pm, MD 123; Tue 6-7 pm, MD 123; Wed 6:30-7:30 pm, MD 223;
Thu 6-7 pm, MD 123; Fri 4-5pm, MD 221 (local only); Sat 4-5pm (zoom only)
Research/Brainstorming Sessions: TBA

Syllabus  Lecture Videos Sign Up for Automated Testing (Git info)  Canvas: College / Extension School
 Zoom for Class  Zoom for OH/Labs  Leaderboard  Self-evaluation/Project0


  1. Introduction

    What is this class about?
    We are in the big data era and data systems sit in the critical path of everything we do. We are going through major transformations in businesses, sciences, as well as everyday life - collecting and analyzing data changes everything and data systems provide the means to store and analyze a massive amount of data. This course is a comprehensive introduction to modern data systems. The primary focus of the course is on the modern trends that are shaping the data management industry right now: column-store and hybrid systems, shared nothing architectures, cache conscious algorithms, hardware/software co-design, main-memory systems, adaptive indexing, stream processing, scientific data management, and key-value stores. We also study the history of data systems, traditional and seminal concepts and ideas such as the relational model, row- store database systems, optimization, indexing, concurrency control, recovery and SQL. In this way, we discuss both how and why data systems evolved over the years, as well as how these concepts apply today and how data systems might evolve in the future. We focus on understanding concepts and trends rather than specific techniques that will soon be outdated - as such the class relies largely on recent research material and on a semi-flipped class model with a lot of hands-on interaction in each class.

    What is this class not about?
    This class is not a traditional introduction on how we use a database system and how to write SQL. Instead, this is a systems class about data system design. You will learn how data systems work at their core and how to design new systems for emerging applications and hardware. By the way, if you know how systems work, you also become better at using them!

    Why take this class?
    Data is everywhere. Every year we create even more data. As it stands, every two days we create as much data as much we created from the dawn of humanity up to 2003 [Eric Schmidt, Google]. Sciences, businesses, and everyday life are substantially affected. Data systems are in the middle of all this. Data systems are how we store and access data, i.e., they are the backbone of any data-driven application. It is a $100B industry, growing 10% every year [Economist, “Data, data everywhere”].

    At the same time data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, there is a continuous need to redefine what is a good data system design in such a dynamic environment.

    CS165 exposes students to the core internals of data systems making it possible to understand core trends in system design and to be one of the few who know how to design and evaluate systems. In addition, due to the way the course is taught (focus on interactive problem solving, open topics and the latest research results) this is also a great class for those who want to understand what CS research is all about and how to engage in doing research.

    What are the learning outcomes?
    1. To become familiar with the history and evolution of data systems design over the past 4-5 decades.
    2. To understand the basic tradeoffs in designing and implementing modern data systems and access methods through a step-by-step hands-on experience.
    3. To be able to design a new data system given a data-driven scenario and build a functional prototype.
    4. To be able to understand which data system is a good fit given the needs of an application.
    5. To deepen C programming, debugging, and performance profiling skills.
  2. Class Philosophy
    CS165 has unlimited office hours, unlimited late days for deliverables, relies on the latest research papers instead of a standard text book, lectures are based on interaction and discussion instead of just “lecturing”, many of the quizzes and problem sets are actually open research problems and most of all it is fun! The instructor and the TFs are here to help you every day and at all times throughout the semester. You may request as many meetings as you like and as much help as you want. The course is also geared towards engaging creative thinking and problem solving to give students a feeling of how computer science research takes place. Many of our students in the past have successfully engaged in research projects with DASlab and published research papers.

    From your side, you should be aware that this is a demanding class that combines knowledge about system design, algorithm design, data structures and includes a non- trivial systems project. You are going to learn state-of-the-art techniques that are being applied in the real world right now. Following the material of the class and performing a successful project requires serious weekly commitment throughout the semester.
  3. Target Audience
    Who can take this class: You probably heard stories that this is a very heavy class and that the project will consume a ton of your time. While this is true, it is also true that you will have a lot of help! So fear not.

    Background: Naturally, the more background you have the smoother your experience in 165 will be. Prior knowledge of C programming and systems programming, as well as a good understanding of computer architecture and in particular the memory hierarchy (cache memories) is very important for this class. Courses providing systems background (like CS50 and in particular CS61 or equivalent) are essential. Good hacking, algorithm designing, and data structures skills are also required.

    A self-evaluation guide is posted on the class website to help you understand if you qualify for the course and how much material you might need to cover. The course (lectures, sections, labs, and office hours) is designed so you can acquire the necessary background even if you are missing some essential knowledge at the beginning of the semester. So we have you covered. However, you should be aware that if you did not breeze through the self-evaluation guide you will have to put in more hours to successfully complete the course. Talk to the instructor if you have not taken CS61 or if you do not feel completely comfortable with the self-test but you still think you are ready for CS165.

    Project 0: We provide a Project 0 that is designed to 1) help you get an idea about how fit you are for the class and 2) bootstrap your semester project. Essentially Project 0 is part of a self-evaluation test and consists of an independent data structure design and implementation in C that you can later on use as is for the first milestone of your semester project.

    If you are reading this text a few weeks or even months before the semester starts, you can use the guidelines on the class website to prepare for the course. There you will find specific study material and programming exercises.

    How can I do great in 165? Just utilize all resources provided. Show up in class to participate in interactive sessions. There are also daily office hours and labs; show up as often as possible so we can help with anything you need! When you find yourself stuck with the project either with a design decision or just a bug, it is normal to struggle for a while — it is part of the learning process — but after some time grab your laptop and come by!
  4. Interaction in Every Class
    Interaction in every class: In every class there will be a 30-40 minute session where students will work on problems in groups of 3-4 students. The instructor and the TFs will be walking around in the classroom to participate in the discussions and brainstorm with the students. The problems will be based on material that has been presented in class and these discussions will be used to either solve open problems or to introduce new ideas. The topics in our midterms will resemble the topics and expectations during those interactive sessions and we will also use those sessions to brainstorm about the milestones of the semester project.

  5. Class Logistics
    Lectures: The class meets twice a week: Mondays and Wednesdays 4:00-5:30pm. Room TBA. Class starts at 4:10pm. Classes are designed to be discussion-based and slides will be used mainly to drive discussions as opposed to delivering the material.

    Interaction in every class: In every class there will be an interactive 30-40 minute session where students will work on problems in groups of 3-4 students.

    Office hours & Labs: Interaction does not stop in lecture time. CS165 is designed to maximize interaction as we truly believe this is the best way to learn; we offer daily office hours and labs.

    Starting Week 1, Prof. Idreos will hold office hours every week day in his office, MD139. Labs are also offered every day of the week as of Week 2. Labs are offered by the TFs. Check the class website to get the exact time slots for both OH and labs.

    The goal of OH is to provide any kind of feedback on the class material. You should come to OH to ask questions about past classes and quizzes. You should also come to OH to discuss the design of your project and to get feedback on your design documents. You are also welcome to come to OH for any other general question regarding classes, carriers in industry/academia, PhDs, etc.

    Labs: Labs can help with similar discussions as with OH but the main goal of Labs is to provide hands-on help for the project. So bring your laptop and your questions about specific project parts you need help with. Labs are the place to go when you have a persistent bug, when you need help with a specific tool for the project (e.g., for debugging or performance testing) or to get feedback about the quality of your coding. Finding and fixing bugs can be very difficult and time consuming. As such, we want to make the time you spend in Labs is as useful as possible. We want to teach you the process of finding and fixing bugs, not just solve a bug for you. We expect that before coming to labs you have spend several hours “fighting” a bug. Then if you cannot make any more progress on your own, you should come by and by then you will have enough experience to really understand the solution and the process. Do not feel like something wrong is happening if you find yourself stack with a bug for a day or two. This is normal and part of the learning process. It will and should happen several times through the semester. Before coming to discuss a bug you should perform/answer several questions on your own: Check the class website for exact instructions.

    We will also offer extra weekend office hours and labs as needed.

    Attendance and Simultaneous Enrollment: Based on the philosophy of the course, attendance in lectures, labs and office hours is optional. The best way to learn, though, is through discussion and interaction with the instructor and the TFs. Our classes are not about "lecturing" - they are semi-flipped and all about interaction. We hope to see you there! If you are a college student and considering simultaneous enrollment then come to OH to discuss if depending on your exact situation this may be a good idea.

    Class Recordings: All classes and interactive sessions in class will be recorded and will be available online. So even if you miss a class it will be easy to catch up and you can also use these recordings to recite specific material throughout the semester (e.g., to prepare for midterms).

    Sections: Another component of the course is sections. Sections are used to deliver material about the class, i.e., to go more deeply into some of the concepts discussed in class, to do additional quizzes, or to deliver background material that is needed to follow next week’s class or for the project. There will be no actual section meeting. Instead, all sections will be recored by the TFs and videos will be posted online. The material posted will be tailored to present a step by step guide for any of the topics presented to make it easy to follow everything without having to be physically present in an actual section. However, if there are still questions about the material presented in sections, you will be able to ask those questions either during the daily office hours or during the daily labs.

    Research Tuesdays: Throughout the semester, on select Tuesdays evenings the instructor, and DASlab PhDs and postdocs will discuss about research! First, DASlab researchers will present their recent work on data systems research and connect it with the material you are learning in class. Then, you will get the chance to talk with them about their research, open problems and be exposed to open research opportunities. Snacks and drinks will be provided.

    Discussion Sessions: It is a tradition in CS165 and CS265 to schedule several discussion sessions throughout the semester. Typically we bring food and drinks and have a relaxed time discussing projects, open research topics, careers in industry and academia, grad school and anything else you may have in mind.

    Who can take this class? You probably heard stories that this is a very heavy class and that the project will consume a ton of your time. While this is true, it is also true that you will have a lot of help! So fear not.

    Background: Naturally, the more background you have the smoother your experience in 165 will be. Prior knowledge of C programming and systems programming, as well as a good understanding of computer architecture and in particular the memory hierarchy (cache memories) is very important for this class. Courses providing systems background (like CS50 and in particular CS61 or equivalent) are essential. Good hacking, algorithm designing, and data structures skills are also required.

    A self-evaluation guide is posted on the class website to help you understand if you qualify for the course and how much material you might need to cover. The course (lectures, sections, labs, and office hours) is designed so you can acquire the necessary background even if you are missing some essential knowledge at the beginning of the semester. So we have you covered. However, you should be aware that if you did not breeze through the self-evaluation guide you will have to put in more hours to successfully complete the course. Talk to the instructor if you have not taken CS61 or if you do not feel completely comfortable with the self-test but you still think you are ready for CS165.

    Test 0: We provide a Test 0 that is designed to 1) help you get an idea about how fit you are for the class and 2) bootstrap your semester project. Essentially Test 0 consists of an independent data structure design and implementation in C that you can later on use as is for the first milestone of your semester project.

    If you are reading this text a few weeks or even months before the semester starts, you can use the guidelines on the class website to prepare for the course. There you will find specific study material and programming exercises.

    How can I do great in CS165? Just utilize all resources provided. Show up in class to participate in interactive sessions. There are also daily office hours and labs; show up as often as possible so we can help with anything you need! When you find yourself stuck with the project either with a design decision or just a bug, it is normal to struggle for a while — it is part of the learning process — but after some time grab your laptop and come by!

    Feedback: We welcome feedback and ideas about the course at any point during the semester. Just come and chat with us during office hours! Tell us how you are keeping up and how we can make it easier for you.

    No Laptop/Phone Policy: CS165 is based on interaction. We want students actively participating in class and interactive sessions, asking and answering questions to maximize learning. In each class, we will bring a printed copy of the slides for each one of the students so you can follow along and to keep notes on paper. So you do not need your laptop or phones for notes or looking up the slides online. In fact, recent studies show that even if you only use a laptop for note taking, it can have a negative impact on how well you understand the material in class. [The Pen Is Mightier Than the Keyboard: Advantages of Longhand Over Laptop Note Taking. Pam A. Mueller and Daniel M. Oppenheimer. Psychological Science. 2014, Vol. 25(6) 1159–1168] (NOTE: There are cases where having a phone or laptop during class is necessary such as when you expect an important call or message or when you need the laptop to better follow the slides due to any issues with your eyes or ears. Just let the instructor know and all such cases will be granted permission to use any tools necessary.)

    Guest lectures: Every semester we arrange a few guest lectures by leaders in data system design from industry and academia. Past guest lecturers in our 2014/2015 classes include: Guy Lohman from IBM Research, Erietta Liarou from EPFL Lausanne, Alkis Simitsis and Georgia Koutrika from HP Labs, Nikita Shamgunov from MemSQL, Laura Haas from IBM Research, Nga Tran from Vertica and Jignesh Patel from University of Wisconsin, Magda Balazinska from University of Washington, Johannes Gherke from Microsoft, Goetz Graefe from Google, Marcin Zukowski from Snowflake, Justin Levandoski from Microsoft Research.

    You will get the opportunity to both hear a guest lecture and to actively participate in discussions with our guest speakers.

    Required textbook: The class is about state-of-the-art data system design. There is no textbook for that. Thus, we use recent research papers and surveys which will be posted on the course website, which you will have access to through the Harvard network. We also use the following textbook: Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke. This textbook is a great source for all the seminal and traditional topics that we will cover.

    Slides/Notes: The slides used during the course will be available online before each class. We will also print slides for you and bring them to each class. If there is material that we want to communicate to you only after class, this will be available shortly after each class.

    SLIDES ARE NOT NOTES! You should not expect the slides to cover the material in detail. The class is based on discussion and problem solving; the slides are tailored to drive the discussion as opposed to serving the material.

    In each class one or more students will be assigned to take notes. After class these students will populate a collaborative notes document and then all students are welcome to jump in and enrich the notes further. Collaborative note taking and editing will be part of your class participation grade and a great way to recite the material and also see how your fellow students perceive it.

    The link to the collaborative notes is available on the top right of the class website.

  6. Online Discussions
    We will use Piazza for online discussions. The link for the class is https://piazza.com/harvard/fall2017/cs165/home for extension, and piazza.com/harvard/fall2017/cs165l/home for college.

    We continuously monitor Piazza and will be answering your questions promptly. In past offerings the average response time was in the order of a few minutes. So you basically have access to the teaching staff all day long. You are welcome to post any question that might help you understand the material better or help you with the project. Anonymous posting (to the other students) will be enabled so that students feel more comfortable posting questions.

    BASIC RULES FOR PIAZZA: We only have a few basic rules so we can keep the forum functional and useful for the students as well as manageable for the staff.
    1. We ask that you first search the forum well before posting a question so that we do not have duplicate entries.
    2. Please make sure to stay on top of all staff posts (especially those that are pinned). Anything we post in Piazza we consider “known.”
    3. Do not use Piazza to post code or ask help with debugging. While it can work in some cases remote debugging is a pain and takes a lot of time. We have labs every day. Bring your laptop and we will help you on site or join remotely and we will help you via a shared screen mode.
  7. Grading
    • Class participation and quizzes: 20%
    • Midterm 1: 15%
    • Midterm 2: 15%
    • Project milestone 1-5: 40%
    • Midway Check-in: 10%

    • Bonus: Extra project tasks: up to 5%
    • Bonus: Speed prize: up to 5%
    This adds up to more than 100%, however the grades are judged upon a 100% scale.

    PASS FAIL? We do not allow pass fail in CS165. Due to the interactive nature of the course, for every student that takes it, the teaching staff need to invest a lot of time during class, OH and labs. We expect students to fully commit and we are here to help you all the way through every single day. AUDITing: We may allow a couple of audit slots depending on the number of students. Contact Stratos.

    Midterms: We hold two midterms. Books and notes may be open during midterms. Laptops, phones or any other electronic devices are not allowed.

    Midterms are not designed to test how much you can remember from the content. Instead, they stress your ability to come up with new solutions, think through all design decisions and side effects of any solution you choose and how you communicate your design. The best way to prepare for midterms is to have an excellent handle on all the topics we work on during our interactive in-class sessions. In particular, the midterms questions would require similar thinking as the interactive sessions. As a result, following the class and the in-class quizzes will naturally help you practice for the midterms.

    You do not have to study for midterms alone. In addition to office hours and labs, before each midterm the instructor will hold special weekend-long meetings to help you go over the current material and past in-class quizzes. You may stay for as long as you need until you feel you are well prepared.

    Feedback on Progress: We provide feedback continuously. The main thing that you will need feedback on is your semester project. The way to get feedback is to show up to our daily office hours and labs and share your design decisions, code, and test results with the staff. In this way, you will get hands-on help and feedback.

    Feedback on midterms will be provided within one week and you are welcome to come by during office hours to discuss any one of the tasks. We will also cover the midterm topics during class 1-2 weeks after each midterm.
  8. Semester Project
    Project Website: http://daslab.seas.harvard.edu/classes/cs165/project.html
    The class has a running project throughout the semester. The project is about designing and implementing a prototype of a modern main-memory optimized column- store data system. By the end of the project you will have designed, implemented, and evaluated several key elements of a modern data system and you will have experienced several design tradeoffs in the same way they are experienced in industry labs.

    This is a challenging but fun project! We will also point to several open research problems throughout the semester that may be studied on top of the class project and that you may decide to take on as a research project. The project has a total of five milestones with specific expected deliverables. The submission of each deliverable includes two parts: source code and a document detailing the major design decisions and why you made them (design document).

    The five deliverables are: 1) basic storage layer, 2) indexing methods optimized for main-memory, 3) shared scans methods, 4) joins, and 5) updates.

    The deliverables will be tested using predefined automated unit tests for functionality and, as extra credit, for performance.

    Automated Testing Infrastructure: We have an automated testing infrastructure. We provide a series of tests (using both fixed and randomized data) to automatically test your code for each project milestone. You are able to submit your code daily and get results by automated emails overnight. Tests run against an in-house Linux server at DASlab. You will be able to find the exact specifications of the machine and tests on the project website. Once you pass all the tests in the testing infrastructure your project is complete!

    Leaderboard: We will have a running competition and an anonymous leaderboard so you can continuously compare your system’s performance against the rest of the class. Essentially this means that we provide additional tests that increase the amount of test data so performance differences between projects will be highlighted. You will be able to run these tests daily as well, so you can improve throughout the semester. We will also provide a "benchmark" entry in the leaderboard which represents what we consider good performance for each milestone based on an in-house implementation from the lab.

    We will give you starting code that implements the basic client-server functionality (i.e., communication) so you can focus on building the server side code, that is, the essential core data processing algorithms and data structures of a database system. In addition, whenever applicable we will let you know if there are existing libraries that is OK to use.

    Evaluation: Individual deliverables should pass all provided tests on the testing infrastructure. However, you will not be judged only on how well your system works; it should be clear that you have designed and implemented the whole system, i.e., you should be able to perform changes on-the-fly and explain design details.

    At the end of the semester each student will have a 1-hour session with the instructor and another 1-hour session with the TFs where the student will demonstrate the system, and answer questions about the design and about supporting alternative functionality. [Tip: From past experience we found that frequent participation in office hours, brainstorming sessions and labs implies that the instructor and the TFs are very well aware of your system and your progress which makes the final evaluation a mere formality in these cases.]

    Collaboration Policy: The project is an individual project. The final deliverable should be personal. You must write from scratch all the code of your system and all documentation and reports. Discussing the design and implementation problems with other students is allowed and encouraged! We will do so in the class as well and during office hours, labs and brainstorming sessions.

    All students that have collaborated with other students in whatever capacity should provide a collaboration statement with their final deliverable to properly acknowledge any ideas that was taken or was influenced by discussions with other students.

    Late Days Policy & Schedule: We allow for 1000 late days or until Harvard requires us to upload your grade! The more input you give us, the more we can help you learn. On the project website and in the project description you can find a detailed time-schedule that we propose you follow. With the exception of the midway check-in (which is a hard deadline), the rest is a “suggested schedule” that will allow you to spread the work throughout the semester and to have sufficient time for each milestone based on the complexity and the work required at each phase of the project. This is an involved project that requires commitment through the entire semester and cannot be done in 2-3 weeks at the end. Not submitting the project milestones on time will have no side- effects on your grade but at the same time, we will not be able to provide you with any feedback on your progress until we have your design documents and your code.

    Note: Experience says that every year a number of students cannot handle the freedom to self-pace, and end up significantly deviating from the schedule. We will send you frequent reminders but you should know that deviating from the schedule by more than a couple of weeks will most likely mean that you will not be able to finish the whole project by the end of the semester (unless you are already an experienced systems student).

    Midway Check-in: The goal here is to demonstrate that you are having decent progress and mainly to avoid falling behind. By October 10 midnight (hard deadline) each student should 1) deliver a design document that describes the intended design for the first two milestones and a description of the rest of the milestones (5%) and 2) have implemented a project that passes at least the first three tests of the first milestone in the automated testing infrastructure (5%). A template of the expected design document will be provided early in the semester.

    Speed Prize: The three fastest projects (top 3 in the leaderboard by the end of the testing period) will gain extra points (5%). The competition will terminate the last day before we need to upload grades so you will have plenty of time to improve (until mid December).

    Extra Points for Bonus Tasks: We will regularly assign extra tasks or you can come up with your own extra tasks for the various components of the project. With these extra tasks you gain extra points (up to 5%).

    What is a Successful Project? A successful project passes all the predefined tests we provide on the testing infrastructure and the student successfully passes the final face-to-face evaluation. A successful final evaluation is one where the student is able: (1) to fully explain every detail of the design, and (2) to propose efficient designs for new functionality on the spot. A month before the final evaluation you will find on the class website a step-by-step guide that will help you prepare for the evaluation meeting.
  9. Extension School

    Lectures will be broadcasted live Mondays/Wednesdays 4-5:30pm. Lectures will also be available for on-demand broadcast within 24 hours after each class. Students will be able to watch the live or recorded broadcast through their browser using the Matterhorn player. The link to the broadcasts for CS165 will be available through the canvas website for this class and will also be posted on the class website before the first lecture.

    Extension school students will be able to participate live in classes, office hours and labs via web-conference tools (we will use Zoom). The course staff will be online with Zoom during each session and you will be able to actively interact with the staff. Other than standard chatting and talking features Zoom also offers screen sharing features which can be used for when you need help with specific issues such as debugging.

    Capturing Discussions: Given that a big portion of the class is based on interaction, extension school in cooperation with the class staff is working to set-up a system with several microphones across the classroom so we can accurately and clearly capture brainstorming discussions and comments during class time. Microphones will “follow” the instructor.

    Grading: Even though we encourage extension school students to utilize the opportunity to interact with the staff and participate in class live we know that for practical reasons this will not be possible for all remote students. For this reason for extension school students there will be no “class participation” grade and the portion of this grade will be distributed equally in project (60%) and midterms (40%).

    Midterms: Extension School will contact students directly regarding administrative preparations and options for midterms. Midterms are proctored and we also allow the new option to take the midterm directly through Canvas with a camera. Local extension school students should come to campus and take the midterm on midterm day (we usually book a slot at ~6pm so it is easy to attend after work).

    Piazza: To participate in piazza you need a Harvard email address. If you do not have one you can create one here: http://www.extension.harvard.edu/resources-policies/resources/computer-e-mail-services

    Office Hours and Labs: If none of the existing slots for office hours and labs do not work (e.g., due to time differences), we will include additional slots; just let us know.

    Starting Date: Note that usually extension school shows the class starting date to be one day after the actual starting date (which is Wednesday August 31, at 4pm). In fact, this is when the first video will be available. However, extension school students will still be able to stream live the first class on August 31 and participate live as normal.

    Accessibility: Harvard and the Extension School are committed to providing an accessible academic community. The Disability Services Office offers a variety of accommodations and services to students with documented disabilities. Please visit www.extension.harvard.edu/resources-policies/resources/disability-services-accessibility for more information and do not hesitate to contact Prof. Idreos directly, by email, with any questions or concerns you might have.

  10. Joining Class Remotely

    Interacting using Zoom: We will be using Zoom for in-class communication [http://zoom.us/]. The class will be recorded through a separate service of Harvard Extension School, so Zoom will not be used for recordings.

    Install and try Zoom: Please navigate to http://zoom.us/ and create an account so we can see your name during class and OH discussion. Download the client (for your computer and/or your tablet and phone).

    Equipment: For the discussion you will need a USB headset with a microphone in order to ensure that the audio will be clear when you speak and that there won’t be feedback that distracts everyone in the class. Make sure that you plug in your headset before you log into Zoom or your audio may not work. It is a good practice to test your audio before class begins. If you forget to plug in your mic before class, restart Zoom. If all else fails, you can also join by telephone. (Go to Joining a Session, which you see after you log on, and then click on the Join by Phone option.) If you have any technical issues during class, please mute yourself and immediately call the HELP Desk at 617-998-8571. They will be there to help you with any technical problem so that you can rejoin class as quickly as possible.

    Using Zoom in class and OH: When class starts all remote students will be muted, but you can un-mute and/or raise your hand when you have questions. Keep your camera on, especially if you are interacting with the class. Always keep camera on during OH.

    Speaker View & Gallery View: Zoom supports speaker view & gallery view: Speaker view highlights one person with four in miniature; gallery view allows the whole class to be visible. Speaker view is for talking/lecturing; gallery view is for whole-class discussion and OH. In some cases, we may share material from the computer; the video will still be available, and the view will be focused on the shared screen.

    Interaction During Class and OH: How to get our attention during class:Â (1) use chat, (2) click on the hand icon in Participants. In OH typically it would be possible to speak and directly jump in when you have questions.

    Meeting Room The meeting URL will be https://zoom.us/j/9063672373 for sections and https://zoom.us/j/462802443 for class. These links will only be active during the time of office hours and class respectively.

    Using Zoom Guide

  11. Plagiarism
  12. How to Read Research Papers

Schedule

*Read: Read carefully the whole thing
*Browse: Read carefully the introduction and just go quickly over the rest of the text

In our first class, we introduce the concept of a data system that is responsible to store data and provide to it declarative access. We further discuss their usefulness and their ubiquitous operation in modern data-driven science, businesses, and everyday-day life. Finally, we also discuss in detail what is the goal of CS165 and the course logistics.

Slides

In our second class we go deeper into the discussion of what a data system is and what data system design means. We begin with a high-level walk-through over various data system designs, and classes of data systems, such as, relational systems, NoSQL systems, Map-Reduce systems, distributed systems and more. We distill the core principles that drive data system design and we discuss how these change over the years as application needs also evolve.

Slides

Readings:

Textbook: chapters 1, 3 (-3.5), 5 (-5.8,-5.9) -- (Intro + relational model + SQL) [Read]

Take a look at The Fourth Paradigm. [Browse]


The following two readings include material up to Class 8. We expect you slowly go through them until Class 8.

Architecture of a Database System (Sections 1,2,3,4)
by J. Hellerstein, M. Stonebraker and J. Hamilton [Read]

The Design and Implementation of Modern Column-store Database Systems
D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden [Read]

Friday, 9/8

1-2pm, MD121. About the guest.
This lecture is part of the IACS invited lectures.
Video recording will be also available a few days later and the link will appear here.

In this class we discuss the importance of designing data models and query languages to generate the right abstractions. In particular, we will make examples using the relational model, the RDF model and languages such as SQL.

Slides

Readings:


Textbook: chapters 1, 3 (-3.5), 5 (-5.8,-5.9) -- (Intro + relational model + SQL) [Read]

Take a look at The Fourth Paradigm. [Browse]

Architecture of a Database System (Sections 1,2,3,4)
by J. Hellerstein, M. Stonebraker and J. Hamilton [Read]

The Design and Implementation of Modern Column-store Database Systems
D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden [Read]

In this class we go deeper into the discussion about what this class really is about: How do the internals of a data system look like and what is the lifecycle of a query? A query is submitted to a query parser, which is then optimized by a query optimizer. The optimization process dictates to the query execution engine the query plan, which is the combination of algorithms to compute the answer to the query by properly utilizing the data structures that hold data and the available hardware. The query plan implements the way to access the data from their physical location through the available access methods of the storage layer. Data that are accessed are temporarily stored in the main-memory bufferpool to avoid repetitive accesses to slower storage levels.

Slides

Readings:


Architecture of a Database System (Sections 1,2,3,4)
by J. Hellerstein, M. Stonebraker and J. Hamilton [Read]


The Design and Implementation of Modern Column-store Database Systems
D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden [Read]

It all starts with how data is physically stored! In this class we discuss the importance of the physical storage of data and its side-effects on data system design and performance. Row-major storage stores all attributes of a relation contiguously, column-major storage stores each attribute separately, while hybrid systems store groups of attributes (columns). This, initially simple, change in physical storage has significant implications in performance, query engine design, and system design. We discuss why column-major storage came into play in the past few years with all major vendors adopting column-store technology, as well as the relevant hardware trends that play a very significant role. We conclude with a brief introduction of the memory hierarchy and how it affects designing access methods and physical storage.

Slides

Readings:


The Design and Implementation of Modern Column-store Database Systems (Sections: all -4.6 and 4.8)
D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden [Read]


IEEE Data Engineering Bulletin, 35(1), March 2012
Special Issue on Column-stores (9 short overview papers) [Read]

Database Architecture Optimized for the New Bottleneck: Memory Access
Peter Boncz, Stefan Manegold, Martin Kersten
In Proc. of the Very Large Databases Conference (VLDB), 1999 [Read]

MonetDB/X100: Hyper-Pipelining Query Execution
Peter A. Boncz, Marcin Zukowski, Niels Nes
In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2005 [Browse]

Materialization Strategies in a Column-Oriented DBMS
Daniel Abadi, Daniel Myers, David DeWitt, Samuel Madden
In Proc. of the Inter. Conference on Data Engineering (ICDE), 2007 [Browse]

Self-organizing tuple reconstruction in column-stores
Stratos Idreos, Martin Kersten, Stefan Manegold
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2009 (ICDE) [Browse]

The most fundamental operation in a column-store or hybrid system is tuple reconstruction. This is the action of stitching one or more columns or column-groups back together because a query needs access to all those attributes. Accessing multiple attributes translates to accessing multiple files, at different locations of the storage (memory or disk). How tuple reconstruction is performed is a crucial part of modern system design. In this class we discuss in detail these design options and how data flows through column-store query plans to minimize cache misses and data movement while performing tuple reconstruction, selections, projections and aggregations.

In addition, we discuss how data flows through plans of modern systems discussing all three alternatives: row-at-a-time, column-at-a-time and vector-at-a-time. We focus on hardware tailored optimizations by discussing vectorized execution in detail. Traditional execution processes one row at a time. This involves long code paths and metadata manipulation in the inner loop of execution. On the other hand, vectorized execution streamlines operations by processing a block of several rows at a time. Each column is stored and treated as a vector (that is, an array of a primitive data type). This approach leads to better compiled code, fewer function calls and conditional branches, leading in general to better processor efficiency.

Slides

Readings:

DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing
Marcin Zukowski, Niels Nes, Peter A. Boncz
International Workshop on Data Management on New Hardware (DaMoN) 2008 [Read]


Column-stores vs. row-stores: how different are they really?
D. Abadi, S. Madden, and N. Hachem
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2008 [Browse]

Positional update handling in column stores
Sándor Héman, Marcin Zukowski, Niels J. Nes, Lefteris Sidirourgos, Peter A. Boncz
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2010 [Browse]

Updating a cracked database
Stratos Idreos, Martin Kersten, Stefan Manegold
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2007 [Browse]

Integrating compression and execution in column-oriented database systems
Daniel J. Abadi, Samuel Madden, Miguel Ferreira
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2006 [Browse]

In this class, we discuss more details about basic column-store design. We first focus on updates, compression and join query plans. We discuss lazy updates, compressing data with dictionary compression and working over compressed data as well as late tuple reconstruction join plans with binary joins. We also introduce the concept of column-store projections and partial adaptive projections as well as optimizations such as on-the-fly switching from n-ary execution to columnar execution. At the end of this class we will have covered all the basic properties of modern column-store systems design.

Slides

Readings:

Cache-Conscious Radix Decluster Projections
DS. Manegold, P. Boncz, N. Nes, and M. Kersten
Very Large Databases Conference, 2004 [Browse]


H2O: A Hands-free Adaptive Store
Ioannis Alagiannis, Stratos Idreos, and Anastassia Ailamaki
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2014 [Browse]


BitWeaving: fast scans for main memory data processing
Yinan Li, Jignesh M. Patel
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2013 [Browse]

In this class we introduce the concept of indexing. Indexes are data structures that allow us to locate specific values of a column fast and are particularly useful if we are interested in a small fraction of the overall data. We initially focus on the simplest form of indexing, i.e., just keeping data sorted. We discuss how we can use sorted columns as indexes in column-stores and the side-effects in terms of tuple reconstruction costs. We then introduce column-store projections as well as the optimization and tuning problem of picking the right projections. We continue by discussing how we can efficiently sort data even if the data does not fit in memory.

Slides

Readings:
Textbook: Chapter 13 [Read]

Self-organizing Tuple Reconstruction in Column-Stores
Stratos Idreos, Martin Kersten, Stefan Manegold
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2009 [Browse]

At this point we start diving into more details about access methods. We begin by scans; the primary access method in modern systems. A scan is a sequential access of an entire relation or an entire column. We discuss how modern data systems perform ultra fast scans by exploiting an array of techniques. In this class we will focus on tight for-loops, cache conscious designs, and utilizing modern multi-cores. Modern CPUs come with excessive capabilities in terms of parallelization. Thus, one of the major challenges in data system design is keeping CPUs 100% busy. We discuss how data system tasks and operators can utilize modern processors and how algorithms and data structures need to be adapted to achieve this. We discuss multi-threaded execution in a modern symmetric multiprocessor system. We also introduce the non-uniform memory access (NUMA) challenge in a modern multi-socket machine. Finally, we discuss how modern data systems share scans between multiple queries in order to avoid redundant data accesses.

Slides

Readings:

Textbook: Chapter 9 [Read]

Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS.
by Marcin Zukowski, Sándor Héman, Niels Nes, and Peter A. Boncz
Proceedings of the International Conference on Very Large Databases (VLDB), 2007 [Read]

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age.
by Viktor Leis, Peter A. Boncz, Alfons Kemper, and Thomas Neumann
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2014 [Browse]

In this class we continue our discussion about how modern data systems can scan data fast. In particular we look into compression and working over compressed data to minimize data movement throughout the memory hierarchy as well as we look into how we can fully utilize modern CPU capabilities for parallelization, e.g., SIMD instructions (Single Instruction Multiple Data). We also discuss why if statements are bad and ways to minimize them in system design.

Slides

Readings:

Selection conditions in main memory.
by Kenneth A. Ross
ACM Transactions on Database Systems (TODS), 2004 [Read]


Implementing database operations using SIMD instructions.
by Jingren Zhou, and Kenneth A. Ross
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2002 [Browse]

Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture.
by Jatin Chhugani et al.
Proceedings of the International Conference on Very Large Databases (VLDB), 2008 [Browse]

In this class we continue the discussion about indexing by introducing B-tree indexes, a tree structure with a wide fanout in order to minimize random accesses while searching for a value or a range of values. We discuss in detail the design of B-trees, how to build, search and update such indexes. We also differentiate between clustered indexes and unclustered indexes in database systems. Their fundamental difference is that clustered indexes are built on top of sorted data on the primary physical layer while unclustered indexes include a new sorted version of the indexed columns.

Slides

Readings:
Textbook: Chapter 8, 9 [Browse], 10 [Read]

Modern B-Tree Techniques (Sections: 1,2,3,5)
by Goetz Graefe,
Foundations and Trends in Databases, 2011 [Read]

Making B+trees Cache Conscious in Main Memory
Jun Rao and Ken Ross
ACM SIGMOD International Conference on Management of Data, 2000 [Read]

A big part of the design of a modern data system is optimizing access methods and code paths for the memory hierarchy. In this class we discuss the design of cache conscious B+ trees, focusing on minimizing cache misses by packing the tree structure in a cache-aware way. In addition, we introduce lightweight indexing methods that allow good data skipping performance without heavy index structures.

Slides

Readings:

Modern B-Tree Techniques
by Goetz Graefe,
Foundations and Trends in Databases, 2011


We discuss common problems that arise with the use of indexing in modern data systems. A typical such problem is that using an index can actually cause performance degradation in some cases which is hard to predict a priori (unless we have up to date statistics). We introduce Smooth Scan, discussing how we can combine scans and indexes into a single access method and always get good performance even in the absence of current statistics.

Slides

Readings:

Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?
by Michael S. Kester, Manos Athanassoulis and Stratos Idreos
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2017 [Read]


Smooth Scan: Statistics Oblivious Access Paths
by Renata Borovica, Stratos Idreos, Anastasia Ailamaki, Marcin Zukowski, and Campbell Fraser,
In Proc. of the Inter. Conference on Data Engineering (ICDE), 2014 [Browse]


Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans
Navin Kabra, David J. DeWitt
In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 1998 [Extra]


Ryan is a senior researcher with LogicBlox. He got his PhD from CMU in 2010 and he was awarded the ACM SIGMOD Jim Gray Dissertation award for his work on data systems architectures for modern hardware.

Slides

Kate Vredenburgh will give a guest lecture on ethics and system/algorithm design. Data systems, algorithms and data analytics increasingly become part of our everyday life. This is both directly, e.g., through data-driven applications we use and indirectly, e.g., through decisions that are being made about our lives via data driven systems. There are numerous emerging ethics issues across the whole stack of system design and usage. In this lecture, we will discuss such issues and we will highlight some of the ways to think about these emerging problems and potentially resolve them.

Slides

During exploratory statistical analysis, data scientists repeatedly compute statistics on data sets to infer knowledge. Moreover, statistics form the building blocks of core machine learning classification and filtering algorithms. Modern data systems, software libraries, and domain-specific tools provide support to compute statistics but lack a cohesive framework for storing, organizing, and reusing them. This creates a significant problem for exploratory statistical analysis as data grows: Despite existing overlap in exploratory workloads (which are repetitive in nature), statistics are always computed from scratch. This leads to repeated data movement and recomputation, hindering interactive data exploration.

We address this challenge in Data Canopy, where descriptive and dependence statistics are synthesized from a library of basic aggregates. These basic aggregates are stored within an in-memory data structure, and are reused for overlapping data parts and for various statistical measures. What this means for exploratory statistical analysis is that repeated requests to compute different statistics do not trigger a full pass over the data. We discuss in detail the basic design elements in Data Canopy, which address multiple challenges: (1) How to decompose statistics into basic aggregates for maximal reuse? (2) How to represent, store, maintain, and access these basic aggregates? (3) Under different scenarios, which basic aggregates to maintain? (4) How to tune Data Canopy in a hardware conscious way for maximum performance and how to maintain good performance as data grows and memory pressure increases?

We demonstrate experimentally that Data Canopy results in an average speed-up of at least 10× after just 100 exploratory queries when compared with state-of-the-art systems used for exploratory statistical analysis.

Readings
Abdul Wasay, Xinding Wei, Niv Dayan, and Stratos Idreos. Data Canopy: Accelerating Exploratory Statistical Analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). ACM, New York, NY, USA, 557-572.

Slides

Richard Hipp is the founder and lead programmer of SQLlite.

Slides

After we have covered access methods we move on to query processing and additional relational operators. In this class, we introduce joins, one of the most fundamental operators in modern systems. Joining two relations based on two join attributes creates the subset of the Cartesian product of the two, where the join attribute in the two relations is equal. We introduce the basic join algorithms: nested-loop join, block nested-loop join, and their optimization zig-zag join. We further discuss sort-merge join and index-joins.

Slides

Readings:


Textbook: Chapters 4 & 14

Having seen the basics of joins we now move on to introduce the join algorithm that is the basis of how the join is most commonly performed in modern data systems. We discuss hash join and some of its major variations in detail. We first introduce the concept of a hash table, which is a constant-complexity search data structure. We discuss in detail how we can perform efficient joins that minimize cache misses in main-memory as well as how to fully utilize modern multi-cores during join.

Slides

Readings

In this class, we specialize our discussion on hashing by discussing dynamic and extendible hashing, as well as linear and two-phase hashing tailored for modern hardware. We also look at the join operator in the context of a query plan. We discuss how the join operator is fed with data and how it feeds with data the next operator in a query plan and what are the side-effects in terms of tuple reconstruction depending on the join algorithm. In this context we also discuss how a join operator can be pipelined.

Slides

Readings
  • Database Management Systems (Textbook): Chapter 11 by S. Raghu Ramakrishnan and Johannes Gehrke [Read]
  • Cache-Conscious Radix Decluster Projections by S. Manegold, P. Boncz, N. Nes, and M. Kersten Proceedings of the International Conference on Very Large Databases (VLDB), 2004

For college students and graduate students the midterm will take place during the normal class time and at the normal room (4:00pm, Pierce 301).

The midterm starts at 4pm. You will have 2 and a half hours to finish it until 6:30pm. You may use any notes and books during midterms. No laptops, tablets or phones are allowed.

The weekend before the midterm the teaching staff will hold additional office hours to help with preparation.

For local extension school students the midterm will take place in late afternoon of the same day. Exact slot and room: (6:30pm, Northwest B101).

Remote extension school students need to arrange a proctor through extension school and take the midterm within 24 hours (from 4:00pm).

One of the most common operations in modern systems is to update data, that is, when new data is added or when existing data is altered. Think of posting an update on Facebook, a new post on Twitter or even paying with your debit card for your coffee which includes updates on you banking account. In this class we discuss the issue of updating efficiently a database. We show how a bufferpool can help not only in read performance but also help to absorb writes and achieve good write performance. We also introduce write-ahead logging that helps us achieve consistency and durability of the data. Finally, we discuss various hardware performance tradeoffs when the storage layer of the system is comprised of different technologies than HDD (e.g., SSD), and the tradeoffs of updating and inserting data in a tree index structure or a column.

Slides

Readings
  • Database Management Systems (Textbook): Chapter 16, 17 and 18 by S. Raghu Ramakrishnan and Johannes Gehrke
  • Positional Update Handling in Column Stores by Sándor Héman, Marcin Zukowski, Niels J. Nes, Lefteris Sidirourgos, Peter A. Boncz In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2010
  • Updating a Cracked Database by Stratos Idreos, Martin Kersten, Stefan Manegold In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2007

In this class we continue the discussion on updates and we focus on transactions. A transaction is a group of operations that reads and/or updates data from the database in an atomic and isolated fashion, while we maintain consistency and durability of the underlying database. For example, think of thousands of users world-wide accessing their bank accounts, removing and adding various amounts. We discuss the basic mechanisms to achieve Atomicity, Consistency, Isolation and Durability. In addition, we discuss about locking schemes that help protect objects of the database from being updated concurrently and creating an inconsistent state of the data. The locking scheme includes a hierarchy of locks, referring to different portions of the data (lock a relation, a page, a tuple). Finally, we clarify the distinction between a logical database lock (a conceptual construct to protect an object from unwanted updates or reads), and a latch (a physical structure that protects a memory location from being updated or read) and discuss how they are both utilized in data system design.

Slides

Readings

This is an extra research class. It will take place towards the end of the semester for a three hour slot. We will introduce the concept of adaptive indexing. We present a design of a database kernel where indexes do not have to be created manually; instead they are created automatically and on-the-fly as new queries arrive. At any point in time the system contains just enough indexes to satisfy the hot workload. As more queries arrive more indexes are created automatically. We discuss how to design such a system and the side-effects of adaptive indexing in core design areas, that is, tuple reconstruction and updates.

In this section we will cover the basic knowledge of C required for this project. We will discuss the main features of the language and will focus on pointer arithmetics a key challenge when programming in C.

Section Recording

This section introduces Project 0, a standalone programming exercise that has two goals. First, it will help the students to understand the coding effort and skills needed to carry on the full project, by implementing a hash table. Second, it will be an integral part that can be used as-is in the full project.

Section Recording

Semester Project

In this section we will introduce the semester project. We will discuss the scope of the whole project and introduce all milestones. We will then focus on the first software and design deliverable. We will also cover frequent pains faced throughout the project.

Section Recording

Background (Relational Model, Relational Algebra, and SQL)

The discussion of the relational model and algebra provides important understanding in modeling data and data manipulation operations that we discuss throughout the class. We will introduce the operators of the relational algebra and we will give examples both in relational algebra and SQL queries.

Section Slides Section Recording Section Recording (short demo)

The section of this week is dedicated to memory hierarchy. Understanding memory hierarchy and how cache memory works is crucial for understanding how to build an efficient cache-aware data system. Hence, here, we will start from the basics of memory hierarchy, covering how caching works, what is an L3 and L2 shared cache, and what is an L1 private cache. We will discuss the differences between instruction and data caches and we will discuss how programs incur cache misses and how this affects performance.

Slides Code Section Recording

Development Tools

In this section we will discuss important development tools. We will talk about debugging tools including [c]gdb and valgrind, and the build tool Gnu make. We will do so by example. The example code is available in the git repository listed below.

Handout (system dev tools) Section Recording

Section Git repository
 
Editor tutorials
A guided tour of Emacs
Learn Vim Progressively
Getting started with Sublime Text 3
 
Navigating Ctags
Ctags with Vim
Ctags with Emacs
Setting up Ctags in Sublime Text
 
Additional resources

Automatic variables in Make
Secondary expansion in Makefiles
Last year's notes on git, valgrind, gcc, and gdb

After having a clear understanding of memory hierarchy, in this section, we will discuss techniques that allow us to build cache-conscious algorithms. We will discuss how to minimize cache misses and how to avoid branch mispredictions by removing branches altogether from our code.

Slides Code Section Recording

In this section we will address performance optimizations and techniques to build high performance code in the context of the project. We will discuss performance monitoring tools (perf) which allow us to know exactly where does the execution time goes and help us understand whether our implementation is efficient, and where are any possible performance bottleneck.

Handout Section Recording

This section gives a brief example of a midterm question and answer. It is meant to give you an idea of how to answer questions on both midterms 1 and 2.

Sample Question and Answer

In this section we cover techniques for parallel programming. We will talk about synchronization primitives, such as mutex locks and atomic operations, as well as design patterns used to parallelize database workloads. We discuss this in the context of both the class and the project.

Slides Section Recording

In this section we will introduce the concept of ACID transactions and the manifestation of this concept throughout the evolution of database systems.

Slides Section Recording

In this section we will cover locking and how it ensures consistency and isolation.

Slides Section Recording

This section covers the principles of maintaining correctness and consistency in a database. We will discuss how atomicity and durability are ensured though logging.

Slides Section Recording

M. Athanassoulis (presenter), Z. Yan, and S. Idreos, "UpBit: Scalable In-Memory Updatable Bitmap Indexing", in ACM SIGMOD International Conference on Management of Data, 2016. [video]

Niv will discuss modern NoSQL key-value stores and present our recent work from this year that shows how all designs in industry are suboptimal and what to change to get a serious boost in performance. This discussion will help position concepts you learn in this class in the context of the relational model to the simpler but widely used today key-value store model. It will also help a lot with getting more exposure to cost modeling and why it can be so powerful and it will be a nice glimpse to CS265 where NoSQL key-value stores are the focus.
Reading.
SIGMOD Presentation (from minute 52 onward).
TBA

Open discussion with the instructor about academia & industry, job hunting, research, the class, your project, or anything else you would like to bring up.

TBA

Open discussion with the instructor about academia & industry, job hunting, research, the class, your project, or anything else you would like to bring up.

Research

Doing awesome research!

The class is geared towards engaging creative thinking and problem solving to give students a feeling of how computer science research takes place. Many of our students in the past have successfully engaged in research projects with DASlab and published research papers. So far five students have made it to the finals of the ACM SIGMOD Undergrad Research Competition. In 2016 we won first place with the work on adaptive denormalization and in 2017 we won first place with the work on evolving trees. The project and the classes will give you plenty of triggers on new problems to work on.

Talk to the instructor at any point if you are excited about pursuing independent research during or after the course.

Past SIGMOD Undergraduate Research Competition Finalists

CS165 Wicked Awesome Semester Project

Design and build a main-memory optimized column-store

Self-Evaluation

Do you have what it takes ?