News

A New Computing Architecture Is Needed for Artificial Intelligence Applications

Introduction

Most compute technologists acknowledge that the world’s data is doubling every two years. There is also an expectation that by 2020, datasets will exceed 21 zettabytes or more for those being used in advanced applications, such as derivatives of artificial intelligence (AI) like machine learning, robotics, autonomous driving, and analytics, as well as financial markets.

These types of applications are all reliant on Big Data, which requires real-time computing of extremely small but heavily layered software algorithms that are ill suited for today’s multi-core processing technologies and architectures, both in terms of performance and efficiency.

1

The Problem: “Where did all the performance go?”

Figure 1 – Multi-core CPU Performance is not keeping up

Until the early 2000s, computer performance was tied to the clock frequency of the processor, Moore’s Law allowing the doubling of frequency every 18 months. Post 2005, Moore’s Law allowed the doubling of processor cores every 18 months. Conventional thinking was that advances in compute performance would then come from adding additional cores to a traditional processor. However, the resulting outcome was one of diminishing returns, see Figure 1. The more cores added to boost performance of a traditional software algorithm, the smaller the incremental gain (with some notable exceptions). So, the core counts stagnated. Rather than adding more cores, the doubling of the number of transistors each generation was instead used to make the “few” cores marginally faster. The exponential increases in transistor resources are wasted, used to only give us small linear gains in performance, Figure 2.

2

Figure 2 – The processing requirements of Big Data & Machine Learning requires that we start using the transistor budgets more efficiently

The amount of data that needs to be processed is increasing at an exponential rate. The current generation of the “cloud” is made up of oceans of racks filled with “few core” servers. The result is a plethora of networked servers that’s creating massive data-center sprawl, and that means significant increases in power usage and carbon footprint throughout the world.

Critically, an inspection of

the workloads of Big Data and Machine Learning today show that

they are vastly different than the workloads the processor was designed and optimized over

decades to handle. Big Data/Machine Learning code size is measured in kLOC (thousand lines

of code) vs. the mLOC (million lines of code) traditional software (think of your favorite office

suite or even your favorite operating system). For example, a simple LOC grep on the popular

BigDataBench 3.2 from the Chinese Academy of Sciences1 for the SPARK applications

framework, covering a dozen different benchmarks shows a total cumulative kLOC of under

1000 lines of SCALA codes. The Yahoo! Streaming Benchmark, less than 300 lines of SCALA

code. Google’s TensorFlow MNIST, the entirety of the tutorial with multiple copies, a shade

over 1000 lines of python code.

The key observations:

1. 2.

The workloads we need to process in Big Data/ML are uniquely different, individually

small and the performance comes from replicating this code across many, many servers.

The amount of data that needs to be processed is increasing exponentially yearly,

streaming in nature, and with some real-time requirements (think autonomous cars or

timely mobile advertisements).

1 http://prof.ict.ac.cn/BigDataBench/

3

3. 4.

The premise:

Thanks to the R&D of the silicon vendors we still have exponential increases in transistor

counts.

Yet are we getting only linear increases in compute performance generation after

generation of processor.

A new processing architecture suited to today’s workloads that offers a scalable,

massively parallel, sea-of-cores approach. Let’s use the transistor budgets to provide

more computation by adding a massive number of cores rather than trying to speed up a

few cores.

CORNAMI, an AI high-performance computing company in Silicon Valley, is addressing these

issues in processing and taking compute performance to extraordinary levels, while greatly

reducing power usage, latency and footprint

CORNAMI has developed and patented a new computing architecture using concurrency technology that uniquely changes software performance and reduces power usage, latency, and platform footprint.

The result is a massively parallel architecture with independent decision-making capabilities at each processing core, interspersed with high-speed memory, and all interconnected by a biologically inspired network to produce a scalable sea-of-cores. It’s based on a unique fabric developed by CORNAMI, called TruStream Compute Fabric (TSCF), which is extensible across multiple chips, boards, and racks, with each core being independently programmable.

By using the TruStream Programming Model (TSPM), multi-core processor resources are abstracted into a common homogenous core pool. TruStream is implemented in both in software and hardware and runs across our TruStream Compute Fabric. Programmers can easily implement concurrency through CORNAMI’s TruStream control structures that are embedded in higher-level standard languages.

4

The Heart of a Machine Learning Processing Engine

Recent announcements from Microsoft2 on their Convolution Neural Network Accelerator, from Google on their Tensor Processing Unit (TPU)3 for neural network acceleration and from NVIDIA on their Volta4 for Deep Learning GPU all reveal a similar approach to accelerating machine learning algorithms.

The algorithms to be accelerated are all variations of a three-dimensional filter (Figure 3). The heavy lifter, the algorithm that consumes the largest proportion of CPU/GPU cycles or Silicon Area is a 3D Convolutional Filter – this is the heart and soul of Convolution Neural Nets.

Figure 3 – Convolutional Neural Network (CNN) showing 3D Convolution + 3D ReLu + 3D Pooling.
Image5 reprinted under the creative commons attribution license from: Gao, F.; Huang, T.; Wang, J.; Sun, J.; Hussain, A.; Yang,

E. Dual-Branch Deep Convolution Neural Network for Polarimetric SAR Image Classification. Appl. Sci. 2017, 7, 447.

To accelerate these algorithms a two-dimensional silicon structure, a silicon accelerator, known as a systolic array6 is used. Figure 3 shows two simple 3×3 matrixes which will be multiplied together and the systolic array, Figure 4, of 3×3 Multiply Accumulate Units that will perform the multiplications. For example, the output matrix element C11 contains the result A11*B11 + A12*B21 + A13*B3 after the input data A and B has finished streaming in. Larger dimensional arrays allow a massive amount of parallelism to occur, more intricate interconnections among array elements increase the bandwidth, more sophisticated functions per element allow both

2 specialized-hardware/

https://www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-

3 https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu 4 https://devblogs.nvidia.com/parallelforall/inside-volta/
5 http://www.mdpi.com/2076-3417/7/5/447
6 “Systolic arrays for (VLSI)”, H.T. Kung and Charles E. Leiserson, 1978

5

convolution and other functions to be performed, and more elaborate control circuitry allow the results to be streamed out, as data streams in, & computations are performed – these are the techniques used are used in production parts. Strip away the complexities and the core idea is simple. Note that the Input matrixes are streamed in a specific order into the left and top of the arrays with the results streaming out of the array. The attributes of a systolic array, namely the ability to stream large amounts of data into and out of the structure continuously, simultaneous operation of all the elements, and the fact that intermediate values are kept internal to the systolic array and do not need to be “saved” to memory, this is what is used to provide the performance required for CNNs.

A11 A21 A31

A12 A13 A22 A23 A32 A33

x B11 B21 B31

B12 B13 B22 B23 B32 B33

= C11 C12 C13 C21 C22 C23 C31 C32 C33

C11 =A11xB11+A12xB21+A13xB31

C21 =A21xB11+A22xB21+A23xB31

C =AxB +AxB +AxB 31 31 11 32 21 33 31

… … …

Figure 4 – 3×3 Matrix Multiply

B B33

B31 B32 B23

B21 B22 B13 B 12

11

C11

A13 A12 A11 A23 A22 A21

A33 A32 A31

C13

C21

C22

C23

C31

C32

C33

Figure 5 – A systolic array of 3×3 elements that performs streaming matrix multiplication and accumulation

C12

6

We can now derive a couple of key insights: 1.

2. 3.

The Premise:

What if we allow the programmer, to define the dimensionality, functionality and interconnectivity of any systolic array via software? Keep the high performance that Machine Learning requires which is a side-effect of the systolic arrays, yet allow the software programmer to keep pace with the new algorithmic developments by allowing changes entirely in software.

The additional benefit is that this software approach allows applicability to a large class of problems, well beyond just Machine Learning, in which operations are performed on elements – cells, pixels, voxels – and their neighbors in a two-or-higher-dimensional array. Examples of these sorts of problems are:

• • • • • • • • • •

Each element of the systolic array is a form of a cellular automata which reacts to the

data around it using simple rules.

The systolic arrays in the announced high-performance Machine Learning Silicon Parts

treat the array as a fixed-function silicon accelerator. The dimensions, element

functionality, and interconnection were fixed when the ASIC mask set was created.

The rapid-fire pace of announcements of new silicon accelerators for Machine Learning

shows that the algorithms are still in a high degree of change. New algorithms are being

published almost weekly with newer, more efficient, approaches to processing.

Committing a machine learning algorithm to silicon dooms it to rapid obsolescent with

the next week’s announcement.

Machine learning

Image processing

Compression/Decompression

Encoding/Decoding

Three-dimensional (3D) modeling

Fluid dynamics

Finite element analysis

Cryptography

Error correction coding

Modeling physical reality7

7 https://en.wikipedia.org/wiki/Digital_physics

7

The Simplest Systolic Array – The Game of Life

Let us walk thru a complete example of this approach using the TruStream Programming Model (TSPM) for software execution on a sea-of-cores TruStream Compute Fabric (TSCF). To illustrate the techniques involved in applying TruStreams to these problems, we’ve chosen the Game-of-Life cellular automaton8, which consists of a two-dimensional grid of cells, each interacting with only its nearest neighbors. This should be familiar to almost all software programmers and simple enough to illustrate the mechanics. When the TSPM program for the Game of Life runs on the TSCF, each Game-of-Life cell runs on its own unique TSCF core.

The TruStream Programming Model

The problem of assembling systolic arrays is very much akin to the problem factory architects have been dealing with, well, since the days of Henry Ford. Factory architects, see the problem as one of organizing a collection of entities – machines, robots, human workers – that interact through streams of widgets flowing from one entity to another. And when these architects draw diagrams of their factories, they draw block diagrams, not flow charts. There is, nevertheless, a role for flow charts in architecting factories: They are perfectly suited to describing the sequential behavior of individual machines, robots and human workers. To summarize:

Determining the sequence of tasks a worker is to perform and laying out a factory are two fundamentally different activities.

It is these observations that inspire our view of computing:

A computer is a factory for assembling data values.

and inspire our view of programming:

Some portions of a program are most easily and naturally expressed as flow charts.

Some portions of a program are most easily and naturally expressed as block diagrams.

Or equivalently:

Some portions of a program are most easily and naturally expressed in the TSPM thread domain.

Some portions of a program are most easily and naturally expressed in the TSPM stream domain.

8 https://en.wikipedia.org/wiki/Conway’s_Game_of_Life

8

That brings us to the TruStream Programming Model, which is based upon five C++ classes:

threadModule

inputStream<T> outputStream<T> streamModule

stream<T>

Encapsulates thread-domain code – i.e., sequential code with TruStream gets and puts NO parallelism is expressed here
A threadModule input

A threadModule output
Encapsulates stream-domain code
Defines a TruStream topology (block diagram) ALL parallelism is expressed here

A TruStream

Connects one or more threadModule outputs

to one or more threadModule inputs

Appears only inside a streamModule Figure 6 – The C++ Classes that make up the TSPM for C++

To gain an intuitive understanding of these classes, refer to Figure 7 and Figure 8, which provide graphical representations of the five types of TruStream objects. Figure 7 also illustrates two key features of the TruStream Programming Model:

TruStream topologies may be cyclic (or acyclic).
Hierarchical topologies are created by nesting streamModules.

Source: http://cornami.com/wp-content/uploads/2017/06/EETimes_NewArchitecture_June2017.pdf

Leave a Reply