DASlab  ·  Harvard University

Data Canopy

Accelerating exploratory statistical analysis
by reusing aggregate structure instead of rescanning base data.

Overview

Statistics are everywhere. They power data science, accelerate scientific discovery, and form core building blocks of many machine learning algorithms. During exploratory analysis, data scientists repeatedly compute related statistics on overlapping parts of the same data set.

Despite that repetition, modern systems typically recompute statistics from scratch every time. The result is redundant data access, repeated scans over base data, and unnecessarily slow exploratory analysis.

Data Canopy addresses this inefficiency by synthesizing statistics from a reusable library of basic aggregates. Instead of throwing away past work, it stores and reuses intermediate aggregate structure so future statistical queries can be answered at the right resolution without repeatedly returning to raw data.

Why "Canopy"?

Like a canopy layered above the underlying data, Data Canopy builds a reusable aggregate structure over the base data set so many future statistical questions can be answered more efficiently.

Core Idea

Repeated analysis should not trigger repeated scans. Data Canopy keeps useful aggregate building blocks around and composes them to answer new questions quickly.

Data Canopy overview idea

Highlights

Data Canopy turns repeated exploratory analysis into a progressively cheaper workload.

Reusable
Basic Aggregates
Statistics are synthesized from shared building blocks instead of recomputed
Logarithmic
Query Assembly
Queries combine aggregates at the right resolution in logarithmic time
Improving
Future Queries
Performance gets better as new requests can reuse past computations

The Data Canopy Approach

Data Canopy computes and organizes basic aggregates so repeated statistical analysis can avoid redundant base-data access.

Aggregate Library

Statistics are built from a reusable set of basic aggregates rather than direct rescans.

Chunk Resolution

Aggregates are maintained at chunk granularity to support flexible combinations later.

Structured Reuse

The data structure stores aggregate information so future requests can reuse earlier work.

Exploratory Workloads

Especially useful when analysts repeatedly ask overlapping questions over the same data.

Data Canopy computes basic aggregates over chunks of the data set and keeps them in an efficient structure that supports later composition. When a new statistical query arrives, the system combines already available aggregates at the appropriate resolution instead of re-reading the full base data.

This changes the behavior of exploratory analysis fundamentally. Repeated requests to compute statistics no longer imply repeated passes over the original data set. Instead, future queries benefit from past work and past data access.

Compared to state-of-the-art tools that repeatedly touch base data and therefore deliver static and often slow performance, Data Canopy improves over time as workloads accumulate reusable aggregate structure.

Data Canopy graph

Impact on Exploratory Analysis

In exploratory settings, analysts repeatedly probe similar parts of a data set while refining hypotheses. That repetition is exactly where Data Canopy pays off: work performed for one query becomes a useful component for the next.

Instead of treating each statistical question as an isolated one-off computation, Data Canopy turns the full session into a cumulative process. The more overlap there is among queries, the more the system can avoid expensive redundant computation and redundant I/O.

No Repeated Scans
Key Benefit
Future queries can leverage prior aggregate computation instead of returning to raw data.

Publications

Abdul Wasay, Xinding Wei, Niv Dayan, Stratos Idreos
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2017

People