Statistics are everywhere! Not only do they power data science and accelerate scientific discovery, but they also comprise the building blocks of many core machine learning algorithms. During the process of data exploration, when data scientists are trying to understand different properties of the data set, many statistics are repeatedly computed on overlapping parts of the data set. Despite this repetition, modern systems always calculate statistics from scratch, leading to redundant data access and slowing down statistical analysis.
We address this problem in Data Canopy, where statistics are synthesized from a library of building blocks that we call basic aggregates. These basic aggregates are computed at the resolution of a chunk on the data set and Data Canopy maintains these basic aggregates in an efficient data structure that allows computation of statistical queries in time logarithmic to the number of basic aggregates. As a result, future queries can avoid having to repeatedly go back to the base data by combining these basic aggregates at the appropriate resolution.
What this means for exploratory analysis is that repeated requests to compute statistics do not trigger multiple passes over the data set, leading to an improvement in query execution time. Compared to other state-of-the-art tools that repeatedly access base data and provide static and slow performance, the performance of Data Canopy keeps on improving as future queries can use past computations and data access.