Deriving knowledge from data is central to how we live, learn, and decide: Machine learning and data science pipelines are extensively applied to extract knowledge from an ever-increasing amount of data across all fields including high-energy physics, astronomy, and genetics. These pipelines consist of multiple stages from data exploration to model design, training, and deployment. Different stages have their own set of algorithms and techniques, yet they share a common challenge – they involve repeated computation on huge data sets. This bottleneck slows down machine learning pipelines, which is problematic not only for latency-sensitive applications (such as self-driving cars and medical diagnosis), but as a result of this bottleneck, only a fraction of the generated data can be processed leading to lower quality models, fewer decisions per time unit, and overall, limited applicability of machine learning. Toward this, Queriosity is a new system that addresses two of the fundamental problems in this area, making data science more intuitive and more interactive.

Smart Synthesis

The first problem is response time; given the complexity of the computations involved and the growing amounts of data in typical data science pipelines, performance quickly becomes a major bottleneck. Queriosity accelerates data science pipelines by smartly synthesizing results out of basic primitives as opposed to recomputing from scratch every time over raw data. This accelerates both current data science and ML algorithms in addition to future algorithms that although will be different are expected to rely on the very same primitives.

Insights

The second key aspect in Queriosity is that it accelerates data science pipelines by providing hints on interesting data areas and patterns to turn the attention of data scientists to promising data areas. This accelerates the process of discovery as typically human understanding and decisions are the major bottlenecks.

Overall, Queriosity accelerates data science, making it more interactive and more intuitive. Queriosity is currently being built in C++ and it also includes a virtual reality front-end. The first critical component is described in our SIGMOD 2017 paper called Data Canopy; it allows to synthesize statistics out of basic primitives as opposed to recomputing from scratch with every request, bringing a speed up of several orders of magnitude to any task that involves statistical computations. Deep Collider paper in ICLR 2021 reconsiders conventional model design wisdom and enables drastically better model design by balancing simultaneously accuracy, training time, deployment time, and memory resources. Finally, MotherNets paper in MLSys 2020 enables fast and accurate training and deployment of ensembles of deep neural networks (2 to 3 percent reduced absolute test error rate and up to 35 percent faster training as compared to state-of-the-art approaches). MotherNets also establishes a new and navigable Pareto frontier for the accuracy-training cost tradeoff of deep neural network ensembles. Stay tuned for more!

Publications

Deep Learning: Systems and Responsibility. Abdul Wasay, Subarna Chatterjee, Stratos Idreos. Proceedings of the ACM SIGMOD International Congress on Management of Data, 2021. [Paper] [website] [video]

More or Less: When and How to Build Convolutional Neural Network Ensembles. Abdul Wasay, Stratos Idreos. International Conference on Learning Representations, 2021. [Paper]

MotherNets: Rapid Deep Ensemble Learning. Abdul Wasay, Brian Hentschel, Yuze Liao, Sanyuan Chen, Stratos Idreos. Proceedings of the Conference on Machine Learning and Systems, 2020. [Paper]

Data Canopy: Accelerating Exploratory Statistical Analysis. Abdul Wasay, Xinding Wei, Niv Dayan, Stratos Idreos. Proceedings of the ACM SIGMOD International Congress on Management of Data, 2017. [Paper] [website] [video]

Queriosity: Automated Data Exploration. Abdul Wasay, Manos Athanassoulis, Stratos Idreos. Proceedings of the IEEE International Congress on Big Data, 2015. [Paper] [Poster]