Stealth startup Cornami on Thursday revealed some details of its novel approach to chip design to run neural networks. CTO Paul Masters says the chip will finally realize the best aspects of a technology first seen in the 1970s.
The tech world is awash in startup companies developing special chips for artificial intelligence and machine learning. Some of the most intriguing ones showed up this week at the Linley Group Fall Processor Conference, hosted by venerable semiconductor analysis firm The Linley Group, in Santa Clara, California.
ZDNet went to the show to get the lowdown on some of these stealth operations.
Thursday morning featured a presentation by a startup company called Cornami, based here in Santa Clara.
Co-fonder and CTO, Paul Masters, described a novel way to arrange the elements of a chip to do both machine learning “training” — where the neural network is developed — as well as “inference,” where the neural network performs on a constant basis to provide answers.
Cornami has been operating in stealth mode and this was the first time Masters pulled open the curtain on some of the details of how the company’s chips work.
Cornami aims to supply its chips to numerous markets, including in “edge computing,” where automobiles and consumer electronics have an especial need for chips that have very responsive performance and are energy-efficient in how they run neural networks.
The chip, said Masters, goes back to technology of the 1970s and 1980s, called a “systolic” array. A systolic array has a plethora of computing elements, such as a multiplier-accumulator, to perform the matrix multiplications that are the fundamental compute unit of neural networks. Wires connect those elements to one another, and to memory, in a grid. Systolic arrays are so named for the systolic functioning of the heart: like blood flow, data is “pumped” through those computational elements.
Systolic arrays never took off when they first emerged, but they are shaping up as the dominant ways to structure an AI chip, based on presentations this week. “You’ve seen it, it’s cool, it’s from the ’70s,” said Masters of systolic arrays.
“Google is using them, and Microsoft, and a zillion startups,” he observed of the popularity of systolic arrays.
But Masters discussed how Cornami has a unique approach to systolic arrays. “The curse of systolic array is that they’re square,” quipped Masters. He was referring to that symmetric arrangement of multiplier-accumulators. Becauser of that rigid arrangement, moving the data in and out of those compute elements takes up an enormous amount of the chip’s effort — more effort in fact, than the compute itself inside each compute element.
“Where does the energy go in legacy silicon?” is the big question, said Masters. “Data gets dumped into DDR [DRAM memory], and it has to go to a core for computation, so the data goes from DDR to the Level 3 cache, the Level 2 cache, and the Level 1 cache, and then into the register, and then into the compute. Then I if I run out of cores, I have to do the reverse, I have to go back out and dump all that temporary data back into the register, the L1 cache, the L2, the L3, and over and over.”
Just to “touch” the L1 cache, explained Masters, takes four times the energy of the actual computation. “And heaven help you if you touch DRAM,” he said, driving up the power required even more to go off the chip.
“The most energy-inefficient thing in legacy machines is moving data,” said Masters. The solution, is to have thousands of cores. By keeping thousands of cores busy, one can keep from going back to the memory subsystem and instead simply route inputs and outputs of compute from one element to the next. “If you have enough cores, 8,000, or 16,000, or 32,000 cores, we can keep the entirety of the neural network on die,” he said.
And so to avoid that cost of going into and out of memory, the Cornami chips arrange their circuits such that the compute elements can be switched to a variety of geometric arrangements that effectively organize the computing activity on chip in a way that changes with the demands of the neural network at the moment.
“Cornami built an architecture where systolic arrays can be built of any size, any shape, on demand.” As the slide up above shows, the systolic array can be dynamically re-arranged to strange new geometries that are not squares. Those strange shapes of arrays make it efficient to move inputs and outputs between the compute elements. And as a result, the Cornami chip can minimize memory and cache references, and thereby “dramatically improve power, latency and performance.”
Masters boasted that with such flexibly, a single Cornami chip can process an entire neural network, and will be able to replace the various combinations of CPUs, GPUs, FPGAs and ASICs typically used to run neural networks. It is a “data center on a chip,” he said, with big implications for putting AI in “edge computing” such as automobiles.
Masters showed off some stats for performance: running the “SegNet” neural network for image recognition, the Cornami chip is able to process 877 frames per second in the neural network using only 30 watts, compared to an Nvidia “Titan V” GPU, which processes only 8.6 frames per second at 250 watts.
Cornami received Series B venture funding of $3 million back in September of 2016 from Impact Venture Capital. The company has received subsequent funding but declined to disclose how much.