Column Sketches are a secondary indexing technique designed to provide consistent and efficient predicate evaluation over column, row, and mixed data layouts.
Each Column Sketch has two components: a compression map and a sketched column. The compression map is a lossy function from values in the base domain to numerical codes, specifically designed to help predicate evaluation. Applying that map value-by-value yields a compact sketched column with fixed-width codes.
When queries arrive, the system first evaluates them affirmatively or negatively for the vast majority of tuples using only the sketched column. Only a small remaining fraction needs to consult the base data, which keeps predicate evaluation fast while remaining robust.
A carefully designed lossy mapping turns original values into compact codes that remain useful for predicate evaluation.
The compressed codes form a small auxiliary column that filters most tuples before touching the base layout.
Column Sketches reduce data access during predicate evaluation while preserving robust performance guarantees.
The key design goal is to compress values in a way that preserves predicate usefulness instead of optimizing only generic compression ratio.
Frequent values receive unique codes so predicate evaluation can often decide directly in the sketch.
The code distribution is kept close to uniform so no single code dominates the sketch and weakens filtering.
Most tuples can be answered positively or negatively using the sketch before consulting base data.
Larger code budgets reduce how often the base layout must be read during predicate evaluation.
The compression map is the mechanism that guarantees Column Sketch performance. It first identifies the heavy hitters of the input domain and assigns each one a unique code. It then shapes the remaining mapping so the resulting code distribution is approximately uniform.
Because of these properties, a Column Sketch rarely needs to touch many base tuples. If a sketch uses b bits per code, then on average the base data is accessed once out of every 2b values. As the code budget grows, the sketch skips more base data and the resulting predicate evaluation reads less than a scan over the original layout.