Fundamentals of Deep Neural Networks

Deep neural networks today have dozens of layers consisting of simple but non-linear modules. We draw examples from convolutional neural networks in this tutorial. This class of networks was first introduced for computer vision tasks but is now utilized in many other applications such as drug discovery, natural language processing, and query optimization. However, many of the principles we discuss, such as the tradeoff between accuracy and training resources, apply to a wide range of deep learning paradigms.

​ Every layer in a neural network progressively transforms its input from one level of representation to a more abstract level, which better captures aspects of the data set that are meaningful for a classification or detection task. Once a deep neural network architecture has been specified, the training process is composed of a set of alternating forward and backward passes until a specified metric (usually training accuracy) converges.


A Query Processing Analogy

We can draw an analogy here between a query processing pipeline and a deep neural network. Every layer in a neural network functions as a semantic filter, only letting information (i.e., patterns) relevant to the task-at-hand go through to the next layer (Zeiler2014, Lecun2015). Overall, every layer in the neural network, in addition to logic, has its own set of data (or parameters) tuned during the training phase. One way of thinking about the training process is getting a query execution pipeline ready. Once this pipeline is ready, i.e., the neural network parameters are trained, deployment proceeds by passing data sets through the neural network and getting the corresponding results or decisions, i.e., probability distribution over labels.


Computation and Data Movement

Training and deploying deep neural networks trigger a large amount of computation on huge data sets. The large data sets needed to train neural networks, the parameters at every layer, and the intermediate results all contribute to this data movement and computation. For instance, Wide ResNets, a state-of-the-art class of neural network architectures, can have up to a million weights per layer and over $40$ layers in a single network (Zagoruyko2016).




Metrics

There are two categories of metrics: quality-related metrics and resource-related metrics. Quality-related metrics include training accuracy, generalization accuracy, and robustness, and they quantify how good a deep neural network is at performing the task it is trained for. On the other hand, resource-related metrics such as training time, inference speed, and memory usage quantify how efficiently a network can achieve the desired result (Goodfellow2016).


Accuracy vs. Time Efficiency

Compression reduces the overall footprint of a neural network and is a prevalent technique to reduce the training time, memory usage, and inference time. Accuracy of the compressed network might be affected depending on whether the technique is lossy or lossless. Compression techniques in deep learning fall under three distinct categories: (i) Quantization, (ii) Parameter prunning, and (iii) Knowledge distillation.

Quantization approaches reduce the size of neural networks by decreasing the precision of network parameters and intermediate results. Both scalar and vector quantization techniques have been explored that replace the original data by a set of quantization codes and a codebook. This codebook maybe constructed in a lossless manner (e.g., Huffman encoding) or in a lossy manner (e.g., low-bit fixed-points or K-means); How lossy this codebook is determines whether or not the quantization affects accuracy. There are proposals to quantize weights, intermediate results, or both down to various precisions ranging from 8-bits to just two bits e.g., in the case of Binary Neural Networks (Gong2014, Talwalker2019). Similarly, there are proposals to replace floating point numbers in the network with integers that are more efficient to operate on (Jacob2018, Wu2018). Finally, there are approaches that dynamically learn how to quantize based on the given architecture, data set, and hardware (Choi2019, Jain2020).

Neural network pruning approaches operate under the premise that many parameters are either unnecessary or not extremely useful and design techniques to remove those parameters. Various approaches prune at different granularity, e.g., parameter-level (Han2015), filter-level (He2017), and network-level (Lazarevic2001). Similarly, various approaches can be used to inform what to prune. These range from magnitude-based approaches (i.e., prune parameters with low-magnitude) (Han2015) to loss-based approaches (i.e., prune parameters that have less effect on a given loss function) (Liu2020) to approaches that automatically learn which parameters to prune (Carreira2018).

Knowledge distillation approaches can reduce the memory and computational footprint of deep neural networks by transferring the function that is learned by large networks into smaller networks (Hinton2015). This class of methods has been used to improve inference at the edge (Romero2014), to accelerate ensemble training (Hinton2015), and to bootstrap the training of large networks (Zhang2018).

This tradeoff between accuracy and various resources has also been extensively explored while training ensembles of deep neural network, where not just one but multiple networks are trained to perform the same task (Abdul2021). Various approaches such as SnapShot Ensembles and Fast Geometric Ensembles generate ensembles of deep neural networks by training a single neural network model once and saving copies of the model at various points in the training trajectory (Huang2017, Garipov2018). Additionally, approaches like TreeNets and MotherNets capture the structural similarity between different networks in a heterogenous ensemble and train for it just once (Lee2015, Abdul2020). All these approaches provide lower accuracy than the baseline approach of training every member of the ensemble from scratch but require significantly less training time. MotherNets and TreeNets also reduce the memory usage and inference time.

In distributed settings the major contributor to both the inference and the training cost is the overhead of communicating between different nodes. Recent work reduces the communication cost by progressively relaxing the constraint that all nodes need to have a fresh copy of the network parameters at all times. Local SGD, for instance, trains various copies of the network in parallel and averages their parameters after a configurable number of training rounds (Stich2018). Tangential to this are those methods that reduce the communication cost by compressing gradients that get communicated between devices (Lin2018). Recently, there has also been work in prioritizing what to communicate between machines (Jayarajan2019).


Optimization Time vs. Training and Inference Time Efficiency

FlexFlow automatically figures out a parallelization strategy given the network architecture and the hardware setup. In particular, FlexFlow introduces an additional optimization step that uses simulation and guided search to decide on a near-optimal parallelization strategy (Lu2017, Jia2018). This optimize-then-parallelize framework has been extended by various subsequent systems to include memory constraints on devices as well as consider devices with heterogenous compute power and communication links (Kraft2020, Neil2020). Additionally, there are approaches that use an optimization step to tailor a model for inference. MorphNet, for instance, iteratively resizes a network based on resource constraints that can be in terms of either model size or computational requirements (Gordon2018). Mnasnet and Netadapt further extend this approach to take into account particulars of the edge device that the network is to be deployed on (Tan2019).


Training Time vs. Memory Efficiency

Many approaches utilize the observation that intermediate results (produced during the forward pass) don't need to be stored but can be recomputed when needed. This reduces the overall memory that is needed to train a neural network at the cost of some extra training time. In particular, there are methods that store equidistant layers (i.e., checkpoints) in the network and can train a network with geometrically less memory at the cost of an additional forward pass (Griewank2000, Chen2016). Recently, Checkmate generalizes this framework and can find an optimal checkpointing strategy given any amount of memory (Jain2020). Another set of solutions to reduce the memory overhead of deep neural networks is to offload the intermediate results to CPU memory (Rhu2016). This again results in some additional training time overhead as the results need to be reread from a slower memory subsystem. We will present opportunities to apply established techniques from database management, such as vectorized processing, end-to-end optimization, and layout design, to further improve these classes of tradeoff in deep learning. We will also highlight how paradigms from the data management world such as standardized benchmarks, query languages, and declarative interfaces can further extend the usability of deep learning models.