Most compute technologists acknowledge that the world’s data is doubling every two years. There is also an expectation that by 2020, datasets will exceed 21 zettabytes or more for those being used in advanced applications, such as derivatives of artificial intelligence (AI) like machine learning, robotics, autonomous driving, and analytics, as well as financial markets.
These types of applications are all reliant on Big Data, which requires real-time computing of extremely small but heavily layered software algorithms that are ill suited for today’s multi-core processing technologies and architectures, both in terms of performance and efficiency.
The Problem: “Where did all the performance go?”
Figure 1 – Multi-core CPU Performance is not keeping up
Until the early 2000s, computer performance was tied to the clock frequency of the processor, Moore’s Law allowing the doubling of frequency every 18 months. Post 2005, Moore’s Law allowed the doubling of processor cores every 18 months. Conventional thinking was that advances in compute performance would then come from adding additional cores to a traditional processor. However, the resulting outcome was one of diminishing returns, see Figure 1. The more cores added to boost performance of a traditional software algorithm, the smaller the incremental gain (with some notable exceptions). So, the core counts stagnated. Rather than adding more cores, the doubling of the number of transistors each generation was instead used to make the “few” cores marginally faster. The exponential increases in transistor resources are wasted, used to only give us small linear gains in performance, Figure 2.
Figure 2 – The processing requirements of Big Data & Machine Learning requires that we start using the transistor budgets more efficiently
The amount of data that needs to be processed is increasing at an exponential rate. The current generation of the “cloud” is made up of oceans of racks filled with “few core” servers. The result is a plethora of networked servers that’s creating massive data-center sprawl, and that means significant increases in power usage and carbon footprint throughout the world.
Critically, an inspection of
the workloads of Big Data and Machine Learning today show that
they are vastly different than the workloads the processor was designed and optimized over
decades to handle. Big Data/Machine Learning code size is measured in kLOC (thousand lines
of code) vs. the mLOC (million lines of code) traditional software (think of your favorite office
suite or even your favorite operating system). For example, a simple LOC grep on the popular
BigDataBench 3.2 from the Chinese Academy of Sciences1 for the SPARK applications
framework, covering a dozen different benchmarks shows a total cumulative kLOC of under
1000 lines of SCALA codes. The Yahoo! Streaming Benchmark, less than 300 lines of SCALA
code. Google’s TensorFlow MNIST, the entirety of the tutorial with multiple copies, a shade
over 1000 lines of python code.
The key observations:
The workloads we need to process in Big Data/ML are uniquely different, individually
small and the performance comes from replicating this code across many, many servers.
The amount of data that needs to be processed is increasing exponentially yearly,
streaming in nature, and with some real-time requirements (think autonomous cars or
timely mobile advertisements).
Thanks to the R&D of the silicon vendors we still have exponential increases in transistor
Yet are we getting only linear increases in compute performance generation after
generation of processor.
A new processing architecture suited to today’s workloads that offers a scalable,
massively parallel, sea-of-cores approach. Let’s use the transistor budgets to provide
more computation by adding a massive number of cores rather than trying to speed up a
CORNAMI, an AI high-performance computing company in Silicon Valley, is addressing these
issues in processing and taking compute performance to extraordinary levels, while greatly
reducing power usage, latency and footprint
CORNAMI has developed and patented a new computing architecture using concurrency technology that uniquely changes software performance and reduces power usage, latency, and platform footprint.
The result is a massively parallel architecture with independent decision-making capabilities at each processing core, interspersed with high-speed memory, and all interconnected by a biologically inspired network to produce a scalable sea-of-cores. It’s based on a unique fabric developed by CORNAMI, called TruStream Compute Fabric (TSCF), which is extensible across multiple chips, boards, and racks, with each core being independently programmable.
By using the TruStream Programming Model (TSPM), multi-core processor resources are abstracted into a common homogenous core pool. TruStream is implemented in both in software and hardware and runs across our TruStream Compute Fabric. Programmers can easily implement concurrency through CORNAMI’s TruStream control structures that are embedded in higher-level standard languages.
The Heart of a Machine Learning Processing Engine
Recent announcements from Microsoft2 on their Convolution Neural Network Accelerator, from Google on their Tensor Processing Unit (TPU)3 for neural network acceleration and from NVIDIA on their Volta4 for Deep Learning GPU all reveal a similar approach to accelerating machine learning algorithms.
The algorithms to be accelerated are all variations of a three-dimensional filter (Figure 3). The heavy lifter, the algorithm that consumes the largest proportion of CPU/GPU cycles or Silicon Area is a 3D Convolutional Filter – this is the heart and soul of Convolution Neural Nets.
Figure 3 – Convolutional Neural Network (CNN) showing 3D Convolution + 3D ReLu + 3D Pooling.
Image5 reprinted under the creative commons attribution license from: Gao, F.; Huang, T.; Wang, J.; Sun, J.; Hussain, A.; Yang,
E. Dual-Branch Deep Convolution Neural Network for Polarimetric SAR Image Classification. Appl. Sci. 2017, 7, 447.
To accelerate these algorithms a two-dimensional silicon structure, a silicon accelerator, known as a systolic array6 is used. Figure 3 shows two simple 3×3 matrixes which will be multiplied together and the systolic array, Figure 4, of 3×3 Multiply Accumulate Units that will perform the multiplications. For example, the output matrix element C11 contains the result A11*B11 + A12*B21 + A13*B3 after the input data A and B has finished streaming in. Larger dimensional arrays allow a massive amount of parallelism to occur, more intricate interconnections among array elements increase the bandwidth, more sophisticated functions per element allow both
3 https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu 4 https://devblogs.nvidia.com/parallelforall/inside-volta/
6 “Systolic arrays for (VLSI)”, H.T. Kung and Charles E. Leiserson, 1978
convolution and other functions to be performed, and more elaborate control circuitry allow the results to be streamed out, as data streams in, & computations are performed – these are the techniques used are used in production parts. Strip away the complexities and the core idea is simple. Note that the Input matrixes are streamed in a specific order into the left and top of the arrays with the results streaming out of the array. The attributes of a systolic array, namely the ability to stream large amounts of data into and out of the structure continuously, simultaneous operation of all the elements, and the fact that intermediate values are kept internal to the systolic array and do not need to be “saved” to memory, this is what is used to provide the performance required for CNNs.
A11 A21 A31
A12 A13 A22 A23 A32 A33
x B11 B21 B31
B12 B13 B22 B23 B32 B33
= C11 C12 C13 C21 C22 C23 C31 C32 C33
C =AxB +AxB +AxB 31 31 11 32 21 33 31
… … …
Figure 4 – 3×3 Matrix Multiply
B31 B32 B23
B21 B22 B13 B 12
A13 A12 A11 A23 A22 A21
A33 A32 A31
Figure 5 – A systolic array of 3×3 elements that performs streaming matrix multiplication and accumulation
We can now derive a couple of key insights: 1.
What if we allow the programmer, to define the dimensionality, functionality and interconnectivity of any systolic array via software? Keep the high performance that Machine Learning requires which is a side-effect of the systolic arrays, yet allow the software programmer to keep pace with the new algorithmic developments by allowing changes entirely in software.
The additional benefit is that this software approach allows applicability to a large class of problems, well beyond just Machine Learning, in which operations are performed on elements – cells, pixels, voxels – and their neighbors in a two-or-higher-dimensional array. Examples of these sorts of problems are:
• • • • • • • • • •
Each element of the systolic array is a form of a cellular automata which reacts to the
data around it using simple rules.
The systolic arrays in the announced high-performance Machine Learning Silicon Parts
treat the array as a fixed-function silicon accelerator. The dimensions, element
functionality, and interconnection were fixed when the ASIC mask set was created.
The rapid-fire pace of announcements of new silicon accelerators for Machine Learning
shows that the algorithms are still in a high degree of change. New algorithms are being
published almost weekly with newer, more efficient, approaches to processing.
Committing a machine learning algorithm to silicon dooms it to rapid obsolescent with
the next week’s announcement.
Three-dimensional (3D) modeling
Finite element analysis
Error correction coding
Modeling physical reality7
The Simplest Systolic Array – The Game of Life
Let us walk thru a complete example of this approach using the TruStream Programming Model (TSPM) for software execution on a sea-of-cores TruStream Compute Fabric (TSCF). To illustrate the techniques involved in applying TruStreams to these problems, we’ve chosen the Game-of-Life cellular automaton8, which consists of a two-dimensional grid of cells, each interacting with only its nearest neighbors. This should be familiar to almost all software programmers and simple enough to illustrate the mechanics. When the TSPM program for the Game of Life runs on the TSCF, each Game-of-Life cell runs on its own unique TSCF core.
The TruStream Programming Model
The problem of assembling systolic arrays is very much akin to the problem factory architects have been dealing with, well, since the days of Henry Ford. Factory architects, see the problem as one of organizing a collection of entities – machines, robots, human workers – that interact through streams of widgets flowing from one entity to another. And when these architects draw diagrams of their factories, they draw block diagrams, not flow charts. There is, nevertheless, a role for flow charts in architecting factories: They are perfectly suited to describing the sequential behavior of individual machines, robots and human workers. To summarize:
Determining the sequence of tasks a worker is to perform and laying out a factory are two fundamentally different activities.
It is these observations that inspire our view of computing:
A computer is a factory for assembling data values.
and inspire our view of programming:
Some portions of a program are most easily and naturally expressed as flow charts.
Some portions of a program are most easily and naturally expressed as block diagrams.
Some portions of a program are most easily and naturally expressed in the TSPM thread domain.
Some portions of a program are most easily and naturally expressed in the TSPM stream domain.
That brings us to the TruStream Programming Model, which is based upon five C++ classes:
inputStream<T> outputStream<T> streamModule
Encapsulates thread-domain code – i.e., sequential code with TruStream gets and puts NO parallelism is expressed here
A threadModule input
A threadModule output
Encapsulates stream-domain code
Defines a TruStream topology (block diagram) ALL parallelism is expressed here
Connects one or more threadModule outputs
to one or more threadModule inputs
Appears only inside a streamModule Figure 6 – The C++ Classes that make up the TSPM for C++
To gain an intuitive understanding of these classes, refer to Figure 7 and Figure 8, which provide graphical representations of the five types of TruStream objects. Figure 7 also illustrates two key features of the TruStream Programming Model:
TruStream topologies may be cyclic (or acyclic).
Hierarchical topologies are created by nesting streamModules.