Beyond Moore’s Law with Parallel Processing & Heterogeneous SoCs
March 01, 2021
Story
2021 Embedded Processor Report: With the dependable performance-per-watt gains of transistor scaling drawing to a close, how will future generations of processors access the compute necessary to efficiently execute demanding workloads? The answer my come via parallel processing on heterogeneous SoCs.
“We’ve been working on 7 nm for a long time, and during that time we not only saw the end of Moore’s law, but we also saw the end of Amdahl’s law and Dennard scaling,” says Manuel Uhm, Director of Silicon Marketing at Xilinx. “What that means is, if all we did was take an FPGA and just shrink those transistors to 7 nm from our previous node, which was 16 nm, and just call it a day, many customers trying to move over the exact same design might quite possibly end up with a design that quite frankly does not have any increase in performance and may, in fact, increase power consumption.
“And clearly that’s going totally the wrong way.”
To be clear, it’s not impossible to shrink silicon transistors below 7 nm; 5 nm devices are already in production. It’s that the underlying metal isn’t running any faster, and current leakage is on the rise.
Meanwhile, in the other direction, traditional multicore devices have hit scaling limitations of their own. Of course, those parallel processors have historically been homogeneous, “and the reality is there is no single processor archiecture that can do every task optimally,” Uhm contests (Figure 1). “Not an FPGA, not a CPU, not a GPU.”
[Figure 1. “No single processor architecture … can do every task optimally. Not an FPGA, not a CPU, not a GPU.”]
This isn’t to say parallelism can’t be advantageous in tackling the complex processing tasks presented by modern applications. Indeed, beyond Moore’s law and Dennard scaling, parallel computing may be our best option in high-performance computing (HPC) and other demanding use cases.
Yes, we still need parallel processing. But of the heterogeneous variety.
Heterogeneous Processing: Not Just for Data Center
As mentioned, the bleeding edge of heterogeneous parallel processing technology is a response to performance walls in high-end applications. But these architectures are also becoming more commonplace in embedded computing environments.
Dan Mandell, Senior Analyst at VDC Research, points out that while “it is true that many heterogeneous processing architectures have been focused on high-end applications, particularly for the datacenter and HPC … miniaturization of FPGA SoCs and other heterogeneous accelerated silicon is top of mind for companies like Microsemi and Xilinx to bring more of these devices into intelligent edge infrastructure like edge/industrial servers and IoT gateways.”
According to Mandell, a key driver of general-purpose heterogeneous computing platforms in the embedded market “is a lot of hesitancy among OEMs and others today about committing to a hardware architecture.” The hesitation, he says, is a product of rapid evolutions in specialized accelerated silicon, as well as uncertainty in the frameworks and workloads that will be produced by the edge software and AI ecosystems in the coming years.
He expects all of these circumstances to “have a great influence in future semiconductor sourcing,” as well as how chip suppliers approach their processor roadmaps.
“The price and power envelope of most of these FPGA SoCs today will force suppliers to initially focus on relatively high-end, high-resource embedded and edge applications,” Mandell posits (Figure 2). “However, there is an active effort to make FPGA SoCs ‘size agnostic’ to eventually support even battery-powered connectivity devices.”
[Figure 2 | The Xilinx Versal VC1902 is a 7 nm device containing an Arm Cortex-A72 and -R5F CPU cores, 400 AI engines, DSP blocks, and significant programmable logic, all integrated using a programmable network on chip.]
So as heterogenous parallel processing becomes more commonplace, should embedded engineers prepare for a paradigm shift in system design? Deepu Talla, Vice President and General Manager of Embedded & Edge Computing at Nvidia, doesn’t think so.
“If you think about it, embedded processors have always used accelerators,” Talla says. “Even 20 years ago, there was an Arm CPU, there was a DSP, and then there was video encode/decode done in specific hardware, right? They’re fixed-function in some sense, but they’re all processing things in parallel.
“The reason you needed to do that was cost, power, size,” he continues. “The efficiency of the parallel processor is orders of magnitude more than just the CPU.”
Nvidia’s Xavier SoC, the device at the heart of their Jetson Xavier embedded platform, as well as the company’s next-generation Orin architecture that will be available in late 2021 or 2022, both equip GPUs, Arm CPUs, deep learning accelerators, vision accelerators, encoders/decoders, and other specialized processing blocks (Figure 3).
[Figure 3 | The Nvidia Xavier SoC equips an Arm-based Carmel CPU, Volta GPU, deep learning and vision accelerators, and other fixed-function compute blocks that can process workloads in parallel.]
However, one change embedded developers can expect as advanced heterogeneous SoCs become more prevalent is the use of network-on-chip (NoC) interconnects, which have progressed over the last decade from traditional on-chip buses like the AMBA interface. This provides “control over how you connect the CPU, GPU, your video encoder, deep learning accelerator, the display processor, the camera processor, the security processor, all those things,” Talla says.
NoCs help accelerate and optimize the flow of data from block to block across the SoC, which aids in the most efficient workload execution possible. For example, NXP has leveraged both NoCs and traditional bus architectures in their versatile line of i.MX SoCs. Recently, the company announced the i.MX9 (Figure 4).
[Figure 4 | The NXP i.MX9 family will incorporate real-time and applications processors, dedicated EdgeLock cryptographic processors, and neural processing units (NPUs), among other compute blocks.]
“Heterogeneous compute is something that we’ve actually been implementing for many years. I believe now is where we are really starting to hit that sweet spot of how we’re using it,” says Dr. Gowrishankar Chindalore, Head of Business & Technology Strategy for Edge Processing at NXP Semiconductors, Inc. “The same is happening with machine learning, because we’re using a CPU, GPU, DSPs, and neural processing unit (NPU) today.
“But part of the optimization, it’s not just the compute elements. It’s everything around the system that needs to happen,” he continues. “So where we’re focusing on improving efficiency, in addition to the heterogeneous compute, is looking at wastage through the whole flow in the chip division pipeline, the video pipeline, the graphics pipeline.
“Because the more we can do that, the more efficiency we get in performance and, clearly, the less energy that’s used to do the same function,” he adds.
(Edittor's Note: Read "Three Ways to Achieve Tenfold Embedded Memory Performance for Heterogeneous Multicore")
Heading Towards a Heterogeneous World
Citing VDC Research’s 2020 IoT, Embedded & Mobile Processors technology report, Mandell expects the global market for embedded SoCs to “continue outgrowing the merchant markets for discrete semiconductors such as MPUs, MCUs, GPUs, etc. for the next several years” as OEMs look to consolidate computing resources and multi-chip implementations. Over the long term, the demand for workload acceleration and processor optimization will only “drive a further uptick,” he says.
In the mean time, the way we measure performance and power consumptiton will have to change. As Mike Demler, Senior Analyst at The Linley Group addresses in his firm’s Guide to Processors for Deep Learning asserts, even new AI-centric benchmarks like TOPS/W are “misleading, because the real AI workloads never achieve close to 100 percent utilization.”
We will have to measure things like power efficiency with “a real workload, such as Bert NLP models, rather than a theoretical, architecture-based specification,” he says.
But does it even make sense to measure the processor complex in isolation anymore? Did it ever really matter? As it always has, the focus will be on what it delivers in the context of your system.
“Before with every process node, it’s like, ‘Oh great. I get double the performance of half the power consumption!’" Uhm says. “Those days are gone. Those days are absolutely gone for everybody. At 7 nm, those transistors start getting leaky now. And you just run into other kinds of problems that are, in many cases, we believe, insurmountable.
“And so, having come to that realization, we’re looking now at system-level problems,” he continues “We’re putting all these things together and understanding all those trade-offs and making sure that we’re able to encompass as much of the processing as possible in a way that allows the performance and power budgets to be met. And again, those aren’t easy things anymore. We realized that we’re going to be able to offer increased performance or decrease power consumption, and in some cases it’s either/or. It’s not always a given that you’re going to get both.
“Again, no processor is optimal for everything. You can’t always increase performance and lower power consumption,” Uhm continues. “But focusing on this new architecture, a heterogeneous processor, essentially that allows them to do that.”