Hardware Software Co-Design - DASlab @ Harvard University

Overview

Near-memory computing has been a technology with great promise for many decades. While early failures can be attributed to immature technology, the advent of 3D stacking, the slowdown of traditional technology scaling benefits, the rise in specialized hardware, and the increased memory density with emerging memory technologies mean that near-memory computing is finally ready for prime time. The missing element is the architecture and software systems to take advantage of it. We believe that in order to unlock the promise of near-memory computing we must facilitate rapid co-design of data systems and hardware architectures.

Our long-term vision is to generate the technology for easy and fast hardware/software co-design. Our ability to collect data is growing at an exponential rate and as a result there is a rapidly growing amount of data-driven applications where the main bottleneck is reading and analyzing data. State-of-the-art data systems and hardware have to push all data through the memory hierarchy to perform a series of complex operations. For example, a data system that stores and analyzes data about click logs in a massive scale web service has to continuously absorb and analyze logs to perform filtering, aggregations, correlations, and other operations. Today, applications can process data right at their source and avoid the main cost of moving data through the memory hierarchy. To explore near-data processing (NDP) for data systems and to best utilize all compute resources, we propose to build software and hardware tailored for such processing. Moreover, we focus on designing such hardware and software quickly and efficiently to scale at the same rate with data and application growth.

Research

Data Systems Design and Hardware Design: The Memory Wall

As it stands today, data system design and hardware design are two independent research fields. Even though their uses are strongly interconnected, data systems and hardware, designed in isolation, lose complementary opportunities for dramatic gains in terms of performance and energy costs. As the explosion of data production continues, current software and hardware approaches are inefficient and exhibit poor scalability. We plan to explore radical new near-memory computing technologies, which offer opportunities to rethink both the software and hardware layers of such systems and take advantage of apparent synergies. There are numerous open and challenging questions: What exactly should a logic unit near a storage device contain? How should a software system schedule its processing tasks across the various computational units (NDP, CPU) to keep all of them utilized at the maximum rate allowed by the current memory bus? How does NDP interface with algorithmic advances in data systems for minimizing data movement such as indexing and scan sharing?

Instead of always moving data up the memory hierarchy closer to the CPU core, which results in more latency and energy requirements as in traditional data systems and hardware, a hardware/software co-designed system processes data locally near the memory whenever possible and in the CPU for more complex operations. This approach better utilizes both processing units and minimizes the amount of data that must traverse the memory hierarchy. For example, the initial filtering operations of the data movement performed in a traditional system, can be done directly in or near the memory in a hardware/software co-designed data system, leaving only the more complex aggregation operations to the CPU and on a smaller subset of the initial data (as we have already performed filtering).

Near Data Processing

We study the potential of near-data processing hardware accelerators for modern data systems and the associated side-effects for data system design. With an increasing number of database applications keeping all or hot data exclusively in large main memories, the memory wall becomes the primary bottleneck. Several database operators, such as selection, projection, and aggregation, produce strictly less than or the same amount of output as input, making them amenable to optimization of data movement. Executing these operators directly in memory and only transporting the necessary data (i.e., qualifying tuples, qualifying columns, or aggregates) through the memory subsystem leaves the CPU free to perform other tasks, reduces cache pollution, and alleviates memory bus pressure. On the other hand, NDP for operators that may produce larger results than their input, like joins, cannot always guarantee performance improvement.

As a first case study, we consider select operatos, which have improved significantly in recent years by using techniques such as working over compressed data, vectorization, and multicores, however, these only help if the system is not memory bound. On the contrary, designing NDP solutions for select operators allows us to avoid moving data completely.

Overview

Research

Data Systems Design and Hardware Design: The Memory Wall

Near Data Processing

People

Michael S. Kester

Lukas Maas

Manos Athanassoulis

Stratos Idreos

Papers