The last FPGA 2017 ACM International Symposium on Field-Programmable Gate Arrays (FPGA) event that took place in Monterey, California US featured an important presentation about a chip development that may well be the future hardware state-of-the-art for deep learning implementations. The talk was also supported by a research paper with the name Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?, and indeed if the development is pushed forward with better research or improved chip industry outputs we might be in for the FPGA era in Artificial Intelligence (AI) frameworks. I recommend a close reading of this paper.
For this post I’ve chosen to refer to the article where I formerly found it. It was in the excellent technological website The Next Platform. The title of the article matches the paper’s so there would not be a need to a link line. But the reader will be directed to the link here.
That article is well written and provides a resumed and succinct picture of a complex development within an already complex and difficult subject. Below I disclose some of the significant paragraphs, which I thought to be most insightful.
Continued exponential growth of digital data of images, videos, and speech from sources such as social media and the internet-of-things is driving the need for analytics to make that data understandable and actionable.
Data analytics often rely on machine learning (ML) algorithms. Among ML algorithms, deep convolutional neural networks (DNNs) offer state-of-the-art accuracies for important image classification tasks and are becoming widely adopted.
At the recent International Symposium on Field Programmable Gate Arrays (ISFPGA), Dr. Eriko Nurvitadhi from Intel Accelerator Architecture Lab (AAL), presented research on Can FPGAs beat GPUs in Accelerating Next-Generation Deep Neural Networks. Their research evaluates emerging DNN algorithms on two generations of Intel FPGAs (Intel Arria10 and Intel Stratix 10) against the latest highest performance NVIDIA Titan X Pascal* Graphics Processing Unit (GPU).
The NVIDIA Titan X Pascal GPUs hardware has demonstrated remarkable performance when processing Deep Learning frameworks such as Deep Convolutional Neural Networks (DNNs). However this has come at the cost of requirements such as high volumes of data, high computational and energy costs. If it is possible to find a new alternative which accomplishes similar performance, even if with low precision and less data input, that would be a welcomed development to many in the AI industry for several reasons. The development described in this article and paper points in that direction.
(…) While AI and DNN research favors using GPUs, we found that there is a perfect fit between the application domain and Intel’s next generation FPGA architecture. We looked at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and considered whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. Our research found that FPGA performs very well in DNN research and can be applicable in research areas such as AI, big data or machine learning which requires analyzing large amounts of data. The tested Intel Stratix 10 FPGA outperforms the GPU when using pruned or compact data types versus full 32 bit floating point data (FP32). In addition to performance, FPGAs are powerful because they are adaptable and make it easy to implement changes by reusing an existing chip which lets a team go from an idea to prototype in six months—versus 18 months to build an ASIC.”
Neural networks can be formulated as graphs of neurons interconnected by weighted edges. Each neuron and edge is associated with an activation value and weight, respectively. The graph is structured as layers of neurons. An example is shown in Figure 1.
(…) The computation heavily relies on multiply-accumulate operations. The DNN computation consists of forward and backward passes. The forward pass takes a sample at the input layer, goes through all hidden layers, and produces a prediction at the output layer. For inference, only the forward pass is needed to obtain a prediction for a given sample. For training, the prediction error from the forward pass is then fed back during the backward pass to update the network weights – this is called the back-propagation algorithm. Training iteratively does forward and backward passes to refine network weights until the desired accuracy is achieved.
Given this heavy computational and data input requirement we would be lead to think that once a technology architecture is developed it would be indisputably the default choice to everyone concerned with the right deep learning and AI hardware framework for their particular goal. But the FPGA developers weren’t beaten in this; they sought to find the changes needed in their architecture to accomplish the performance on par with GPUs. And they succeed:
Hardware: While FPGAs provide superior energy efficiency (Performance/Watt) compared to high-end GPUs, they are not known for offering top peak floating-point performance. FPGA technology is advancing rapidly. The upcoming Intel Stratix 10 FPGA offers more than 5,000 hardened floating-point units (DSPs), over 28MB of on-chip RAMs (M20Ks), integration with high-bandwidth memories (up to 4x250GB/s/stack or 1TB/s), and improved frequency from the new HyperFlex technology. Intel FPGAs offer a comprehensive software ecosystem that ranges from low-level Hardware Description languages to higher level software development environments with OpenCL, C, and C++. Intel will further align the FPGA with Intel’s machine learning ecosystem and traditional frameworks such as Caffe, which is offered today, and with others coming shortly, leveraging the MKL-DNN library. The Intel Stratix 10, based on 14nm Intel technology, has a peak of 9.2 TFLOP/s in FP32 throughput. In comparison, the latest Titan X Pascal GPU offers 11TFLOPs in FP32 throughput.
Indeed low precision data types is gaining traction within some DNNs communities o developers:
Emerging DNN Algorithms: Deeper networks have improved accuracy, but greatly increase the number of parameters and model sizes. This increases the computational, memory bandwidth, and storage demands. As such, the trends have shifted towards more efficient DNNs. An emerging trend is adoption of compact low precision data types, much less than 32-bits. 16-bit and 8-bit data types are becoming the new norm, as they are supported by DNN software frameworks (e.g., TensorFlow)
This trend is compounded by the efficiency enhancements discovered by some researchers of using sparsity matrices (introduction of lots of zeros within a large numerical matrix), pruning of neurons and applying Rectified Linear Units (ReLu) as the activation:
Moreover, researchers have shown continued accuracy improvements for extremely low precision 2-bit ternary and 1-bit binary DNNs, where values are constraints to (0,+1,-1) or (+1,-1), respectively. Dr. Nurvitadhi co-authored a recent work that shows, for the first time, ternary DNN can achieve state-of-the-art (i.e., ResNet) accuracy for the well-known ImageNet dataset. Another emerging trend introduces sparsity (the presence of zeros) in DNN neurons and weights by techniques such as pruning, ReLU, and ternarization, which can lead to DNNs with ~50% to ~90% zeros. Since it is unnecessary to compute on such zero values, performance improvements can be achieved if the hardware that executes such sparse DNNs can skip zero computations efficiently.
The emerging low precision and sparse DNN algorithms offer orders of magnitude algorithmic efficiency improvement over the traditional dense FP32 DNNs, but they introduce irregular parallelism and custom data types which are difficult for GPUs to handle. In contrast, FPGAs are designed for extreme customizability and shine when running irregular parallelism and custom data types. Such trends make future FPGAs a viable platform for running DNN, AI and ML applications. “FPGA-specific Machine Learning algorithms have more head room,” states Huang. Figure 2 illustrates FPGA’s extreme customizability (2A), enabling efficient implementations of emerging DNNs (2B).
The rest of the article provides further technical detail about these developments. It makes for a worthy read to the more technical engineer, hopefully a reader of this Blog. I will instead turn to the paper for some further deeper remarks and a conclusion final remark.
Study hardware and methodology
Type Intel Arria 10 1150 FPGA Intel Stratix 10 2800 FPGA Titan X PascalGPU Peak FP32 1.36 9.2 11 On-chipRAMs 6.6 MB(Mass) 28.6 MB(M20Ks) 13.5 MB(RF, SM, L2) Memory BW Assume sameas Titan X Assume sameas Titan X 480 GB/s
GPU: Used known library (cuBLAS) or framework (Torch with cuDNN)
FPGA: Estimated using Quartus Early Beta release and PowerPlay
To begin with is this little paragraph from the paper which illustrates nicely the advantage of the ternary DNN over the binary DNN:
Support for Ternarized DNNs Lastly, the support for Ternarized DNNs (TNNs) in our architecture is as follows. In TNNs, weights are constrained to 0, +1, or -1, but neurons are still using N-bit precision. In this case, we represent ternary weights using 2 bits, with 1 bit indicating whether the value is 0 and another bit indicating whether the value is +1 or -1 (as in BNNs). The PE uses the 1-bit zero indicator in the same way as the metadata bits used to exploit sparsity in Section 3.2.2. As such, whenever there is an operation against a zero weight, the operation will be skipped and not be scheduled onto the PE’s compute unit(s). If the weight is either +1 or -1, instead of performing multiplication against N-bit precision neuron values, we simplify the computation by negating the sign of the neuron value (i.e., a sign bit flip if neuron is floating-point, or negation if it is fixed point). As such, PE for ternary computation does not require a multiplication unit (…)
Following this is a comparison of the different methodologies used in classic DNNs versus the changed compact data type with the FPGA framework with ternary operation and taking advantage sparsity matrix operations:
and…the FPGA speedup empirically demonstrated:
Finally the paper ends with some important discussions about the main trends in these FPGAs hardware framework developments. Interesting lines of further research, indeed:
Mathematical Transforms (e.g., Winograd).The first trend is in optimizations using mathematical transforms. In particular, Winograd transformation  has been shown to be amenable to small DNN filters (e.g., 3×3) that are common in state-of-the-art DNNs. Fast Fourier Transforms (FFTs) have also been shown to be amenable for larger filters (5×5 and above), which are still used in some DNNs. FPGAs have been known to be an efficient platform for FFTs (e.g., ), and one could expect that they would be well-suited for Winograd transformations as well. These transforms are often computable in a streaming data fashion and involve an arbitrary set of mathematical operators. And, there are many possible transformation parameters that lead to different compositions of mathematical operators. Such computation properties (arbitrary composition of operations on streaming data) are likely to be amenable to FPGAs.
Compression. There are various compression techniques that have been proposed for DNNs, such as weight sharing , hashing , etc. These techniques require fine-grained data accesses, with indexing and indirection on lookup tables, which a FPGA fabric is particularly good at.
FPGA vs. GPU Studies. Finally, there are existing studies that compare FPGAs against GPUs. The work in  compares BLAS matrix operations among CPU, FPGA, and GPUs. The work in  compare Neural Networks implemented on CPU, FPGA, GPU, and ASIC. However, these studies target older generation FPGAs and GPUs, while we target the latest Stratix 10 FPGA and Titan X Pascal GPU. Moreover, these prior studies do not focus on all emerging DNNs that are studied in this paper.
Can FPGAs beat GPUs in performance for next-generation DNNs? Our evaluation of a selection of emerging DNN algorithms on two generations of FPGAs (Arria 10 and Stratix 10) and the latest Titan X GPU shows that current trends in DNN algorithms may favor FPGAs, and that FPGAs may even offer superior performance.
Our results show that projected Stratix 10 performance is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively. We also presented a case study on Ternary ResNet, which relies on sparse GEMM on 2-bit weights, and achieved accuracy within ~1% of the full-precision ResNet. On TernaryResNet, the Stratix 10 FPGA is projected to deliver 60% better performance over Titan X Pascal GPU, while being 2.3x better in performance/watt. Our results indicate that FPGAs may become the platform of choice for accelerating DNNs.
A glimpse of a possible future for the acceleration of deep learning and AI hardware was today revealed here in The Information Age.