Over the last two decades data analytics systems have undergone tremendous architectural changes: (i) oftentimes they execute solely from memory, (ii) they use columnar or column-group storage, and (iii) they employ performance optimizations for scans including (a) scan sharing, (b) vectorization, c) working over compressed data, (d) SIMD execution, and (e) multi-threaded execution.
These design decisions put increasing pressure on index scans to the point where many systems opt not to use secondary indexes and rely only on a single access path: a fully optimized scan. From a research and design point of view the problem becomes especially intriguing when considering shared scans; if we can process >> 1 queries with a single scan, it could be that indexes are not needed any more. The two research questions we answer in this work are the following:
Given all advancements in data system design and scans, are there still important cases where we need to consider two access paths?
And if so, how to perform access path selection?
To answer these fundamental access path questions we embarked on developing a detailed model and a fine tuned implementation that capture the behavior of fast scans and memory optimized secondary B+-tree Indexes given all modern design techniques. Our analysis takes into account (i) the hardware characteristics, (ii) the dataset characteristics (data size, tuple size, physical layout), (iii) the workload characteristics (query concurrency, query selectivity) and (iv) any access method tuning (branching factor, level of parallelism for scans).
Access Path Selection happens based on the ratio between the expected cost of using concurrent index accesses and the expected cost of using a shared scan for a batch of queries. This ratio is comprised of seven key components, shown in the figure below. First, the index cost depends on (i) traversing the tree, (ii) traversing the leaves, (iii) writing the result, (iv) sorting the result (with the rowID order). The last step is needed to ensure a drop-in replacement for the sequential scan operator (and as a result, clustered accesses in other columns). Second, the scan cost, depends on (i) running a base scan of the data, (ii) evaluating the predicate, and (iii) writing the result.
For the past forty years access path selection uses a fixed selectivity threshold typically taking into account hardware parameters as well. Our results show that it is time to reconsider this practice. There is no fixed access path selection point any more. Instead it varies based on concurrency. As concurrency increases, the usefulness of the index drops as more queries can be processed with a single scan. A system should capture these properties on-the-fly in order to make accurate decisions. The lightweight access path selection model shown above can be incorporated in any modern analytical system without any extra complications as all new variables needed can be captured easily and accurately online.
In order to understand whether a data analytics system needs to add indexing capabilities we first need to have a good estimate of the performance gain it can offer. Our interactive calculator offers just that!
For multiple concurrent selections on the same column the model provides a detailed comparison between shared scans and concurrent index traversals. Access Path Selection is performed every time the value of any parameter is altered in the text-boxes below!
Michael S. Kester, Manos Athanassoulis, Stratos Idreos
Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?
In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2017.