DASlab  ·  Harvard University

Data Calculator

What if we could reason about the design space of data structures?
Computing — instead of inventing — data structures from first principles.

SIGMOD 2018  ·  IEEE Bulletin 2019

Overview

Data structures are everywhere — and designing the right one is hard.

Data structures are the backbone of every data-driven system, from databases and file systems to machine learning pipelines and blockchains. Every year, approximately 100 new data structures are proposed in the research literature alone, driven by continuous changes in hardware and workload requirements.

Yet, designing a new data structure today remains a slow, manual, and error-prone process. It takes years of expert effort to bring a new data structure from concept to production. Over those years, hardware and workload assumptions change, threatening to reduce the benefit of any new design before it is even deployed.

The root challenge is that the design space of data structures is vast — there are more possible data structure designs than stars in the observable universe. With traditional methods, we explore only a tiny fraction of what is possible, relying on intuition and experience rather than systematic reasoning.

Our thesis: we can compute data structures from first principles. By capturing the fundamental design primitives of data layout and combining them with learned cost models, the Data Calculator enables interactive and semi-automated design of data structures — inventing and evaluating trillions of designs in seconds.

Highlights

Reasoning about all possible data structure designs —
from first principles to learned cost synthesis.

1048
Designs Explored
Candidate data structure designs
Seconds
To Invent
From specification to new design
Trillions
Unknown Designs
Previously undiscovered variants
3
Core Steps
Design · Learn · Synthesize
<5%
Cost Estimation Error
Accurate synthesis for complex designs
3
Use Modes
What-if · Auto-design · Self-design

The Data Calculator: Design from First Principles

The Data Calculator enables interactive and semi-automated design of data structures. It captures the first principles of data layout design and uses learned cost models to compute the performance of arbitrary designs.

The Data Calculator is a data structure design engine. It takes a workload specification, hardware profile, and performance targets as input and outputs an optimal data structure design. Unlike manual design — which relies on years of expert intuition — the Data Calculator reasons systematically over the entire design space.

It works in three steps. First, Design Synthesis from First Principles: nearly all data structures are combinations of a small set of fundamental layout primitives — how nodes organize data and how they relate to each other. The Data Calculator captures these primitives and combines them to define a massive design space of possible structures.

Second, Learned Primitive Access Models: a library of machine-learned models describes fundamental access patterns on blocks of data, capturing algorithmic, engineering, and hardware properties. Third, Algorithm and Cost Synthesis: for each operation in a workload, the Data Calculator synthesizes the exact algorithm and its cost using an expert system, deciding the best access pattern for each node in the data path and computing the total expected cost from the learned models.

First Principles

Fine-grained design primitives capturing fundamental data layout concepts.

Learned Models

ML models trained on diverse hardware and data profiles for accurate cost prediction.

Cost Synthesis

Expert system synthesizing algorithms and costs for arbitrary designs on target hardware.

What-If Analysis

Iteratively test design, workload, and hardware combinations in seconds.

The Periodic Table of Data Structures

A unifying framework for understanding the design space — just as the periodic table organizes elements, we organize the fundamental building blocks of data structures.

Design Primitives

Fundamental concepts from which all data structures can be composed, like elements in chemistry.

Design Continuums

Smooth transitions between data structures — B-trees, LSM-trees, hash tables, and trillions of hybrids.

Cross Design Spaces

Navigate across fundamentally different design families to find optimal hybrid structures.

Auto-Design

Automatically identify the best possible design to match a given workload and hardware target.

The periodic table of data structures provides a unifying map of the vast design space. Just as chemistry's periodic table reveals relationships between elements and predicts new compounds, our periodic table reveals relationships between data structure designs and predicts new, previously unexplored structures.

At its core, the table captures design continuums — smooth transitions from one design family to another. For example, there exists a continuum from pure B-trees to pure LSM-trees, with an infinite number of hybrid designs in between. Each point on this continuum represents a valid data structure with different trade-offs between read cost, write cost, and memory usage. The Data Calculator can evaluate any point on any continuum, enabling self-designing systems that adapt their data structure to changing workloads and hardware in real time.

Publications

Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo
SIGMOD 2018 ACM SIGMOD International Conference on Management of Data, 2018
Stratos Idreos, Kostas Zoumpatianos, Subarna Chatterjee, Wilson Qin, Abdul Wasay, Brian Hentschel, Mike Kester, Niv Dayan, Demi Guo, Minseo Kang, Yiyou Sun
IEEE Bulletin 2019 IEEE Data Engineering Bulletin, 2019

People