## A 1.4 GHz 695 Giga RISC-V Inst/s 496-core Manycore Processor with Mesh On-Chip Network and an All-Digital Synthesized PLL in 16nm CMOS

Austin Rovinski<sup>1</sup>, Chun Zhao<sup>2</sup>, Khalid Al-Hawaj<sup>3</sup>, Paul Gao<sup>2</sup>, Shaolin Xie<sup>2</sup>, Christopher Torng<sup>3</sup>, Scott Davidson<sup>2</sup>, Aporva Amarnath<sup>1</sup>, Luis Vega<sup>2</sup>, Bandhav Veluri<sup>2</sup>, Anuj Rao<sup>4</sup>, Tutu Ajayi<sup>1</sup>, Julian Puscar<sup>4</sup>, Steve Dai<sup>3</sup>, Ritchie Zhao<sup>3</sup>, Dustin Richmond<sup>2</sup>, Zhiru Zhang<sup>3</sup>, Ian Galton<sup>4</sup>, Christopher Batten<sup>3</sup>, Michael B Taylor<sup>2</sup>, Ronald G Dreslinski<sup>1</sup>

<sup>1</sup>U. Michigan, Ann Arbor, MI; <sup>2</sup>U. Washington, Seattle, WA; <sup>3</sup>Cornell U., Ithaca, NY; <sup>4</sup>UC - San Diego, San Diego, CA;

**Abstract** - This paper presents a 16nm 496-core RISC-V network-on-chip (NoC). The mesh achieves 1.4GHz at 0.98V, yielding a peak of 695 Giga RISC-V instructions/s (GRVIS) and a record 812,350 CoreMark benchmark score. The main feature is the NoC architecture, which uses only 1881 $\mu$ m<sup>2</sup> per router node, enables highly scalable and dense compute, and provides up to 361 Tb/s of aggregate bandwidth.

## Introduction

Complex, data-parallel workloads continue to push towards edge devices, such as mobile and IoT platforms. In particular, streaming-based workloads like real-time computer vision are steadily increasing in demand. Mobile devices demand high energy efficiency to attempt these computationally-intensive workloads. At the same time, the hardware must remain flexible to perform state-of-the-art algorithms as well as workloads that emerge post-fabrication. Prior high-efficiency manycore architectures<sup>[1-3]</sup> that target streaming workloads have yielded high area and energy efficiencies (Table 3). However, much of the die area for these architectures were dedicated towards the NoC, including cache-coherence protocol controllers, which limits compute density and efficiency. We demonstrate a new NoC architecture that enables fast inter-node communication with a significantly reduced die area (3.7x-118x) compared to prior work. The processor is composed of a 496-core array of 5-stage, in-order RISC-V RV32IM cores in a mesh configuration (Fig. 1). It achieves a peak of 695 GRVIS and a record 812,350 CoreMark benchmark score.

## **Manycore Architecture**

In order to achieve a high compute density, the network architecture (Fig. 1) differs significantly from a traditional coherent shared-memory model. Instead of caches, the manycore processor has a partitioned global physical address space across all network nodes. Each core's memory occupies an address range which is globally accessible by any core over the network. The network enforces remote stores as part of the Remote Store Programming (RSP) model, which both obviates logic and prevents pipeline stalls associated with long-latency remote loads. The network has a single virtual channel with dimension-ordered routing, which greatly simplifies the network logic. With RSP, this guarantees deadlock-free, in-order delivery. The routers are single-stage, which allows for minimal memory to hold in-transit messages and single-cycle latency per hop. Rate limiting and memory fences are implemented via source-controlled credit counters. Credits are returned over a separate 9-bit NoC with the same architecture as Fig. 1. To enter and exit the NoC, messages are sent to an address below the bottom of the mesh, which will be sent to a mesh-attached host processor running a full operating system, such as Linux.

Fig. 2 shows the layout of a single tile, which contains an RV32IM core and the routing logic for that node. The core contains 2x 4KB SRAMs for I- and D-MEMs, and a 32-entry,

32b register file implemented using two 1r1w latch-based memories. By cell area, the core occupies 17863µm<sup>2</sup> (90.5%) and the router occupies 1881µm<sup>2</sup> (9.5%). The router supports 80b transfers per cycle, which packages data, address, and commands into a single flit. This technique provides a faster and simplified model compared to traditional approaches (Fig. 5, Table 2). The router and core run on the same clock domain up to 1.4GHz, allowing each tile to both transfer 750 Gb/s and process 1.4 GRVIS. Several gaps were created between rows of tiles to allow for ESD cells and In-Cell Overlays (ICOVL) as required for fabrication (Fig. 6). The total die area of the manycore is 15.25mm<sup>2</sup> as fabricated with ESD+ICOVL (or 12.03mm<sup>2</sup> without). This yields an area efficiency of 45.57 GRVIS/mm<sup>2</sup> (57.77 GRVIS/mm<sup>2</sup>).

The processor clock is supplied by a fully synthesized and automatically placed-and-routed clock generator. It operates from an isolated 0.8V supply and occupies 5898µm<sup>2</sup>. The output frequency is tunable from 10MHz to 3.3GHz in steps of  $\leq 2\%$ , with a (simulated) period jitter of < 2.5ps. The core is a 1<sup>st</sup>-order FDC-based PLL (Fig. 5). The 16 ring DCOs together cover 1.3-3.3 GHz. Each DCO inverter delay element is loaded with a bank of NAND gate frequency control elements (FCEs)[4], 37 of which are controlled by the DCO drift compensator to adjust for temperature and supply variations. The DCO control logic partitions its input into integer and fractional parts. The former drives 8 FCEs with an update rate of  $f_{ref} = 26$  MHz. The latter is oversampled by a 2<sup>nd</sup>-order  $\Delta\Sigma$  modulator which drives 8 FCEs through a dynamic element matching encoder.

## **Experimental Results**

We run the industry-standard CoreMark benchmark. The benchmark was slightly modified by combining two loops in order to reduce the binary size by 80B to fit within a tile's I-MEM. Fig. 3 identifies the operating configurations where CoreMark reports a correct result for all tiles. The processor achieves a max throughput of 695 GRVIS at 1.4GHz and 0.98V – *the highest single-chip RISC-V throughput to date* – and a max energy efficiency of 314.89 GRVIS/W at 500MHz and 0.60V. It achieves a record CoreMark score of 812,350, *outperforming the next best score by more than 2x*.

Our NoC router outperforms all compared works for normalized area (3.7x-118x), minimum latency, and overhead (Table 2). Kilocore[3] modestly outperforms this work in network bandwidth, however it uses a circuit switched network which is statically routed prior to runtime. The performance exceeds TILE64[1] and Piton[2] by at least 27.6x for normalized area efficiency, 7.0x for energy efficiency, and 4.8x for throughput. Kilocore[3] performs similarly for area efficiency, although it uses a 16-bit datapath and a small memory size (7x less I+D-Mem than this work). We still achieve a 2.1x-46.8x higher energy efficiency than the compared works. ESSCIRC '14[5] reports the state-ofthe-art in GRVIS throughput, which we outperform by 267x.

This research employed a BaseJump ASIC Motherboard; the bringup effort was partly funded by the DARPA/SRC JUMP ADA center.



generator's core PLL, and one of its DCOs.

f Network Bisection Bandwidth = (min. # links cut to bisect network) \* (link bandwidth)