Column Sketches - DASlab @ Harvard University

Overview

Column Sketches are a secondary indexing technique designed to provide consistent and efficient predicate evaluation over column, row, and mixed data layouts.

Each Column Sketch has two components: a compression map and a sketched column. The compression map is a lossy function from values in the base domain to numerical codes, specifically designed to help predicate evaluation. Applying that map value-by-value yields a compact sketched column with fixed-width codes.

When queries arrive, the system first evaluates them affirmatively or negatively for the vast majority of tuples using only the sketched column. Only a small remaining fraction needs to consult the base data, which keeps predicate evaluation fast while remaining robust.

Compression Map

A carefully designed lossy mapping turns original values into compact codes that remain useful for predicate evaluation.

Sketched Column

The compressed codes form a small auxiliary column that filters most tuples before touching the base layout.

Highlights

Column Sketches reduce data access during predicate evaluation while preserving robust performance guarantees.

Core Components

Compression map plus sketched column

1 / 2^b

Base Data Access

Average access rate when using b-bit sketch codes

Robust

Predicate Evaluation

Efficient over column, row, and mixed data layouts

Research

The key design goal is to compress values in a way that preserves predicate usefulness instead of optimizing only generic compression ratio.

Heavy Hitters

Frequent values receive unique codes so predicate evaluation can often decide directly in the sketch.

Uniform Codes

The code distribution is kept close to uniform so no single code dominates the sketch and weakens filtering.

Mostly Sketch-Only

Most tuples can be answered positively or negatively using the sketch before consulting base data.

Less Base Access

Larger code budgets reduce how often the base layout must be read during predicate evaluation.

Column Sketch predicate evaluation example

The compression map is the mechanism that guarantees Column Sketch performance. It first identifies the heavy hitters of the input domain and assigns each one a unique code. It then shapes the remaining mapping so the resulting code distribution is approximately uniform.

Because of these properties, a Column Sketch rarely needs to touch many base tuples. If a sketch uses b bits per code, then on average the base data is accessed once out of every 2^b values. As the code budget grows, the sketch skips more base data and the resulting predicate evaluation reads less than a scan over the original layout.

Overview

Compression Map

Sketched Column

Highlights

Research

Heavy Hitters

Uniform Codes

Mostly Sketch-Only

Less Base Access

Paper

People

Brian Hentschel

Michael Kester

Stratos Idreos