arxiv: 2604.17862 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AR

Recognition: unknown

M100: An Orchestrated Dataflow Architecture Powering General AI Computing

Yan Xie , Changkui Mao , Changsong Wu , Chao Lu , Chao Suo , Cheng Qian , Chun Yang , Danyang Zhu

show 29 more authors

Hengchang Xiong Hongzhan Lu Hongzhen Liu Jiafu Liu Jie Chen Jie Dai Junfeng Tang Kai Liu Kun Li Lipeng Ge Meng Sun Min Luo Peng Chen Peng Wang Shaodong Yang Shibin Tang Shibo Chen Weikang Zhang Xiao Ling Xiaobo Du Xin Wu Yang Liu Yi Jiang Yihua Jin Yin Huang Yuli Zhang Zhen Yuan Zhiyuan Man Zhongxiao Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AR

keywords dataflow architectureAI inferenceautonomous drivinglarge language modelscompiler-architecture co-designtensor-based schedulinggeneral AI computingM100

0 comments

The pith

M100 uses compiler-orchestrated dataflow to enable efficient general AI inference for autonomous driving and large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents M100 as a new architecture aimed at providing versatile and efficient computing for AI inference tasks. It claims that a dataflow design, combined with tight compiler and hardware integration, allows explicit management of data movement instead of relying on caches, leading to improved performance and reduced complexity. This matters for applications in autonomous vehicles where both driving perception and language-based interactions are needed, as well as for running large models. Benchmarks indicate it surpasses standard GPU setups in utilization for driving tasks. If successful, it points to dataflow methods as a way to balance generality and efficiency in AI hardware.

Core claim

M100 is a dataflow parallel architecture that uses compiler-architecture co-design to orchestrate computation and, crucially, data movement across time and space. Tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, largely eliminating caching. With the tensor chosen as the fundamental scheduling unit, M100 demonstrates capability on UniAD for autonomous driving and LLaMA for LLMs, outperforming GPGPU architectures in AD applications through higher utilization.

What carries the argument

Orchestrated dataflow architecture with compiler-architecture co-design managing data streams at tensor granularity to eliminate most caching.

Load-bearing premise

The compiler-architecture co-design can orchestrate data movement and scheduling for diverse and changing AI workloads without excessive overhead or sacrificing broad applicability.

What would settle it

A benchmark comparison on a newly released AI model for autonomous driving or LLM inference where M100 fails to show higher utilization or requires significantly more compilation time than a GPGPU baseline.

Figures

Figures reproduced from arXiv: 2604.17862 by Changkui Mao, Changsong Wu, Chao Lu, Chao Suo, Cheng Qian, Chun Yang, Danyang Zhu, Hengchang Xiong, Hongzhan Lu, Hongzhen Liu, Jiafu Liu, Jie Chen, Jie Dai, Junfeng Tang, Kai Liu, Kun Li, Lipeng Ge, Meng Sun, Min Luo, Peng Chen, Peng Wang, Shaodong Yang, Shibin Tang, Shibo Chen, Weikang Zhang, Xiaobo Du, Xiao Ling, Xin Wu, Yang Liu, Yan Xie, Yihua Jin, Yi Jiang, Yin Huang, Yuli Zhang, Zhen Yuan, Zhiyuan Man, Zhongxiao Yao.

**Figure 2.** Figure 2: Architecture of the M100 NPU memory system without multi-level [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 5.** Figure 5: The high level architecture of M100 NPU. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 4.** Figure 4: The high level block diagram of M100 SoC. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 7.** Figure 7: illustrates the structure of a TPB cluster. There are two main reasons for introducing a cluster-level hierarchy. First, four TPBs can share common resources—such as the instruction buffer, ICB and DRB nodes, and a RISC-V CPU—allowing more silicon area to be allocated to tensor processing, thus improving compute density. Second, close Cluster Instruction Chain Bus Node Cluster Instruction Queue (CIQ) TPB … view at source ↗

**Figure 9.** Figure 9: Architecture of the HBSM. The 2MB HBSM SRAM is uniformly shared across all TPB functional units. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 10.** Figure 10: An example of 3-level TWU. sequences, enabling efficient access to both input activations and weights. Typically, a TPB functional unit has two or more input TWUs, and one output TWU. The TWU can be configured by the TPB instruction for that functional unit, specifying the number of nested loop levels, and the Initial, Step, and Final values for each loop level. After configuration, the TWU generates one … view at source ↗

**Figure 11.** Figure 11: Architecture of the TCU. The Tensor Computing Unit (TCU) accelerates tensor contraction operations using a dense array of compute elements. To sustain high throughput under limited memory bandwidth, data reuse is essential. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 12.** Figure 12: The architecture of CVU [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 13.** Figure 13: Overview of M100 AI compiler toolchain. • The space-time scheduler maps a neural network subgraph onto the M100 NPU hardware. If necessary, large tensors are partitioned into mini-tensors that are passed 8 [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗

**Figure 14.** Figure 14: Space-time scheduler subgraph mapping and tensor streaming on [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗

**Figure 15.** Figure 15: UniAD Framework. indicate sustained activity, while gaps denote idle or waiting periods. In this trace, throughout most of the sampling window, the DMAs in the CCB—along with the TCU, CVU, CSU, and GSDU in one of the TPBs—remain continuously active, with substantial overlap in task execution. This indicates high hardware utilization and highlights the M100 architecture’s strong parallel execution capabili… view at source ↗

**Figure 16.** Figure 16: Detailed execution trace of M100 TPB instructions collected by the [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗

read the original abstract

As deep learning-based AI technologies gain momentum, the demand for general-purpose AI computing architectures continues to grow. While GPGPU-based architectures offer versatility for diverse AI workloads, they often fall short in efficiency and cost-effectiveness. Various Domain-Specific Architectures (DSAs) excel at particular AI tasks but struggle to extend across broader applications or adapt to the rapidly evolving AI landscape. M100 is Li Auto's response: a performant, cost-effective architecture for AI inference in Autonomous Driving (AD), Large Language Models (LLMs), and intelligent human interactions, domains crucial to today's most competitive automobile platforms. M100 employs a dataflow parallel architecture, where compiler-architecture co-design orchestrates not only computation but, more critically, data movement across time and space. Leveraging dataflow computing efficiency, our hardware-software co-design improves system performance while reducing hardware complexity and cost. M100 largely eliminates caching: tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, yielding greater efficiency and scalability than cache-based systems. Another key principle was selecting the right operational granularity for scheduling, issuing, and execution across compiler, firmware, and hardware. Recognizing commonalities in AI workloads, we chose the tensor as the fundamental data element. M100 demonstrates general AI computing capability across diverse inference applications, including UniAD (for AD) and LLaMA (for LLMs). Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization, representing a promising direction for future general AI computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M100 is a dataflow architecture from Li Auto that replaces caches with compiler-managed streams and uses tensor granularity for AD and LLM inference, but the performance claims rest on unspecified benchmarks.

read the letter

The main point is that this paper describes M100, a dataflow architecture built by Li Auto for efficient AI inference on automotive platforms. It targets both autonomous driving workloads like UniAD and LLMs like LLaMA by using compiler-architecture co-design to orchestrate data movement with streams instead of caches and by scheduling at tensor level rather than finer or coarser units. The design aims to cut hardware complexity and cost while keeping enough generality for multiple inference tasks. That combination of goals is the concrete contribution here. The paper does a solid job explaining why tensor granularity fits common AI patterns and how eliminating caches can improve scalability and efficiency in data movement across on-chip and off-chip memory. The focus on automotive constraints like cost and power makes the motivation practical rather than abstract. The description of runtime-managed streams handling computation and data across time and space is straightforward and shows real thought about the bottlenecks in current GPGPU setups. The soft spot is the evidence. The abstract and description assert higher utilization and better performance than GPGPUs in AD applications, yet no numbers, test setups, or comparison details appear. Without those, it is impossible to judge whether the gains hold or how large they are. The stress-test concern about dynamic LLM behavior also lands: autoregressive token generation and variable attention introduce data-dependent control flow that static or profiled orchestration may not handle without extra overhead or fallback paths. The paper does not show how the system avoids losing generality or efficiency when workloads deviate from the profiled AD cases. This work is mainly useful for hardware architects and compiler teams working on edge or automotive AI accelerators. Readers already thinking about dataflow alternatives to GPUs will pick up the design principles and the specific choices around caching and granularity. It is not yet at the stage where someone could directly replicate or extend the results. I would send it to peer review. The ideas are coherent enough and the target domain is important enough that referees should see the full implementation details and any additional data to assess whether the claims hold up.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces M100, a dataflow parallel architecture developed by Li Auto for general-purpose AI inference targeting autonomous driving (UniAD), large language models (LLaMA), and intelligent human interactions. It describes a compiler-architecture co-design that orchestrates data movement across time and space using compiler- and runtime-managed streams instead of caches, selects the tensor as the fundamental scheduling granularity, and claims this yields higher performance, utilization, and scalability than GPGPU architectures while reducing hardware complexity.

Significance. If the performance claims and generality hold, M100 would represent a meaningful contribution to AI accelerator design by demonstrating how dataflow orchestration and tensor-level co-design can bridge the gap between versatile but inefficient GPGPUs and narrow DSAs. The approach of replacing caches with managed streams and using tensor granularity addresses well-known inefficiencies in data movement for AI workloads and could inform future hardware-software co-design efforts in automotive and LLM domains.

major comments (2)

[Abstract] Abstract: The central claim that 'Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization' is presented without any quantitative results, baselines, utilization percentages, methodology, error bars, or comparison details, rendering the performance advantage impossible to evaluate or reproduce.
[Abstract] Abstract: The architecture description asserts that compiler- and runtime-managed data streams enable general AI computing across AD and LLMs, but provides no analysis of how static tensor scheduling and stream orchestration handle dynamic, data-dependent control flow in LLM inference (autoregressive generation, variable attention patterns) without prohibitive overhead or fallback mechanisms, which directly undermines the generality claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization' is presented without any quantitative results, baselines, utilization percentages, methodology, error bars, or comparison details, rendering the performance advantage impossible to evaluate or reproduce.

Authors: We agree that the abstract claim would be stronger with supporting quantitative context. The full manuscript provides detailed benchmarks in the evaluation sections, including comparisons to GPGPU baselines for UniAD workloads, utilization rates, and methodology. To improve evaluability, we will revise the abstract to include concise quantitative highlights (e.g., relative performance gains and utilization improvements) drawn from those results. revision: yes
Referee: [Abstract] Abstract: The architecture description asserts that compiler- and runtime-managed data streams enable general AI computing across AD and LLMs, but provides no analysis of how static tensor scheduling and stream orchestration handle dynamic, data-dependent control flow in LLM inference (autoregressive generation, variable attention patterns) without prohibitive overhead or fallback mechanisms, which directly undermines the generality claim.

Authors: The referee correctly notes the absence of explicit analysis on dynamic control flow. While the runtime-managed streams are intended to provide flexibility beyond purely static scheduling, the manuscript does not detail overheads for autoregressive generation or variable attention. We will add a dedicated subsection in the architecture or evaluation section analyzing these mechanisms, including how stream reconfiguration supports data-dependent patterns in LLaMA inference and associated costs. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture paper lacks derivations or fitted predictions

full rationale

The paper is an architectural description of M100's dataflow design, compiler co-design, tensor granularity, and stream-based memory management. It presents no equations, no fitted parameters, no predictions derived from inputs, and no self-citation chains that reduce claims to prior work by the same authors. Performance assertions rest on unspecified benchmarks for UniAD and LLaMA rather than any self-referential construction. The central claims are therefore independent of the circularity patterns enumerated in the instructions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on domain assumptions about AI workload commonalities and dataflow efficiency but introduces no free parameters, invented entities, or ad-hoc axioms beyond standard computer architecture premises.

axioms (1)

domain assumption AI workloads share commonalities best captured by tensor operations as the fundamental scheduling unit
Explicitly stated when selecting tensor granularity for scheduling, issuing, and execution.

pith-pipeline@v0.9.0 · 5702 in / 1206 out tokens · 45144 ms · 2026-05-10T05:46:47.727281+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 8 canonical work pages · 5 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A vision-language-action 11 flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review arXiv 2024
[2]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review arXiv 2022
[3]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

2023
[4]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,”arXiv preprint arXiv:2312.13139, 2023

work page internal anchor Pith review arXiv 2023
[5]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yanget al., “Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation,”arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review arXiv 2024
[6]

NVIDIA Jetson AGX Orin Series, A Gi- ant Leap Forward for Robotics and Edge AI Applications, Techni- cal Brief,

L. S. Karumbunathan, “NVIDIA Jetson AGX Orin Series, A Gi- ant Leap Forward for Robotics and Edge AI Applications, Techni- cal Brief,” https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/ jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf, Jul. 2022, ac- cessed: 2025-08-21

2022
[7]

NVIDIA Jetson Thor,

NVIDIA Corporation, “NVIDIA Jetson Thor,” 2025, accessed: 2025- 08-21. [Online]. Available: https://www.nvidia.com/en-us/autonomous- machines/embedded-systems/jetson-thor/

2025
[8]

Computer and redundancy solution for the full self-driving computer,

P. Bannon, G. Venkataramanan, D. D. Sarma, and E. Talpes, “Computer and redundancy solution for the full self-driving computer,” in2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–22

2019
[9]

Samsung to make tesla’s hw 4.0 self-driving auto chip,

J.-S. Hwang, “Samsung to make tesla’s hw 4.0 self-driving auto chip,” https://www.kedglobal.com/semiconductors/newsView/ ked202109230009, 2023, accessed: 2025-08-21

2023
[10]

Elon musk reveals the first details about hardware 5 autopilot computer and sensors,

C. Agatie, “Elon musk reveals the first details about hardware 5 autopilot computer and sensors,” https://www.autoevolution.com/news/elon- musk-reveals-the-first-details-about-hardware-5-autopilot-computer- and-sensors-235405.html, 2024, accessed: 2025-08-21

2024
[11]

Data flow supercomputers

J. B. Dennis, “Data flow supercomputers.”Computer, vol. 13, no. 11, pp. 48–56, 1980

1980
[12]

Advances in the dataflow computational model,

W. A. Najjar, E. A. Lee, and G. R. Gao, “Advances in the dataflow computational model,”Parallel computing, vol. 25, no. 13-14, pp. 1907– 1929, 1999

1907
[13]

Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads,

D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, J. Hwang, R. Leslie- Hurd, M. Bye, E. Creswick, M. Boyd, M. Venigalla, E. Laforge, J. Purdy, P. Kamath, D. Maheshwari, M. Beidler, G. Rosseel, O. Ahmad, G. Gagarin, R. Czekalski, A. Rane, S. Parmar, J. Werner, J. Sproch, A. Macias, and B. Ku...

2020
[14]

A software-defined tensor streaming multiprocessor for large-scale machine learning,

D. Abts, G. Kimmell, A. Ling, J. Kim, M. Boyd, A. Bitar, S. Parmar, I. Ahmed, R. DiCecco, D. Han, J. Thompson, M. Bye, J. Hwang, J. Fowers, P. Lillian, A. Murthy, E. Mehtabuddin, C. Tekur, T. Sohmers, K. Kang, S. Maresh, and J. Ross, “A software-defined tensor streaming multiprocessor for large-scale machine learning,” in Proceedings of the 49th Annual In...

work page doi:10.1145/3470496.3527405 2022
[15]

Plasticine: A reconfigurable architecture for parallel paterns,

R. Prabhakar, Y . Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel paterns,”ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 389–402, 2017

2017
[16]

Sambanova sn10 rdu: Accelerating soft- ware 2.0 with dataflow,

R. Prabhakar and S. Jairath, “Sambanova sn10 rdu: Accelerating soft- ware 2.0 with dataflow,” in2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 2021, pp. 1–37

2021
[17]

Sambanova sn40l: Scaling the ai memory wall with dataflow and composition of experts,

R. Prabhakar, R. Sivaramakrishnan, D. Gandhi, Y . Du, M. Wang, X. Song, K. Zhang, T. Gao, A. Wang, X. Liet al., “Sambanova sn40l: Scaling the ai memory wall with dataflow and composition of experts,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 1353–1366

2024
[18]

Intel Gaudi 3 AI Accelerator: Architected for Gen AI Training and Inference ,

S. Lie, “ Wafer-Scale AI: GPU Impossible Performance ,” in2024 IEEE Hot Chips 36 Symposium (HCS). Los Alamitos, CA, USA: IEEE Computer Society, Aug. 2024, pp. 1–71. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/HCS61935.2024.10664673

work page doi:10.1109/hcs61935.2024.10664673 2024
[19]

Cerebras architecture deep dive: First look inside the hard- ware/software co-design for deep learning,

——, “Cerebras architecture deep dive: First look inside the hard- ware/software co-design for deep learning,” inIEEE Micro, vol. 43, no. 3. IEEE, 2023, pp. 18–30

2023
[20]

Tenstorrent scales ai performance: Architecture leads in data-center power efficiency,

L. Gwennap, “Tenstorrent scales ai performance: Architecture leads in data-center power efficiency,”Microprocessor Report, Tech. Rep., apr, 2020

2020
[21]

Blackhole & tt-metalium: The standalone ai computer and its programming model,

J. Vasiljevic and D. Capalija, “Blackhole & tt-metalium: The standalone ai computer and its programming model,” in2024 IEEE Hot Chips 36 Symposium (HCS). IEEE Computer Society Los Alamitos, CA, USA, 2024, pp. 1–30

2024
[22]

The microarchitecture of dojo, tesla’s exa-scale computer,

E. Talpes, D. D. Sarma, D. Williams, S. Arora, T. Kunjan, B. Floering, A. Jalote, C. Hsiong, C. Poorna, V . Samantet al., “The microarchitecture of dojo, tesla’s exa-scale computer,”IEEE Micro, vol. 43, no. 3, pp. 31– 39, 2023

2023
[23]

Amd xdna™ npu in ryzen™ ai processors,

A. Rico, S. Pareek, J. Cabezas, D. Clarke, B. Ozgul, F. Barat, Y . Fu, S. M ¨unz, D. Stuart, P. Schlangenet al., “Amd xdna™ npu in ryzen™ ai processors,”IEEE Micro, 2024

2024
[24]

Evaluation of xilinx versal architecture for next-gen edge computing in space,

N. Perryman, C. Wilson, and A. George, “Evaluation of xilinx versal architecture for next-gen edge computing in space,” in2023 IEEE aerospace conference. IEEE, 2023, pp. 1–11

2023
[25]

Neu- ronflow: A hybrid neuromorphic–dataflow processor architecture for ai workloads,

O. Moreira, A. Yousefzadeh, F. Chersi, A. Kapoor, R.-J. Zwartenkot, P. Qiao, G. Cinserin, M. A. Khoei, M. Lindwer, and J. Tapson, “Neu- ronflow: A hybrid neuromorphic–dataflow processor architecture for ai workloads,” in2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2020, pp. 01–05

2020
[26]

In-datacenter performance analysis of a tensor processing unit,

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

2017
[27]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware sup- port for embeddings,

N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towleset al., “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware sup- port for embeddings,” inProceedings of the 50th annual international symposium on computer architecture, 2023, pp. 1–14

2023
[28]

Mtia: First generation silicon targeting meta’s recommendation systems,

A. Firoozshahian, J. Coburn, R. Levenstein, R. Nattoji, A. Kamath, O. Wu, G. Grewal, H. Aepala, B. Jakka, B. Dreyeret al., “Mtia: First generation silicon targeting meta’s recommendation systems,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13

2023
[29]

Meta’s second generation ai chip: Model-chip co-design and productionization experiences,

J. Coburn, C. Tang, S. A. Asal, N. Agrawal, R. Chinta, H. Dixit, B. Dodds, S. Dwarakapuram, A. Firoozshahian, C. Gaoet al., “Meta’s second generation ai chip: Model-chip co-design and productionization experiences,” inProceedings of the 52nd Annual International Sympo- sium on Computer Architecture, 2025, pp. 1689–1702

2025
[30]

Pact xpp—a self-reconfigurable data processing architecture,

V . Baumgarte, G. Ehlers, F. May, A. N ¨uckel, M. V orbach, and M. Wein- hardt, “Pact xpp—a self-reconfigurable data processing architecture,”the Journal of Supercomputing, vol. 26, no. 2, pp. 167–184, 2003

2003
[31]

Dynamically spe- cialized datapaths for energy efficient computing,

V . Govindaraju, C.-H. Ho, and K. Sankaralingam, “Dynamically spe- cialized datapaths for energy efficient computing,” in2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 2011, pp. 503–514

2011
[32]

Morphosys: an integrated reconfigurable system for data- parallel and computation-intensive applications,

H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho, “Morphosys: an integrated reconfigurable system for data- parallel and computation-intensive applications,”IEEE transactions on computers, vol. 49, no. 5, pp. 465–481, 2000

2000
[33]

The gpu computing era,

J. Nickolls and W. J. Dally, “The gpu computing era,”IEEE micro, vol. 30, no. 2, pp. 56–69, 2010

2010
[34]

Scarpazza

Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the nvidia volta gpu architecture via microbenchmarking,”arXiv preprint arXiv:1804.06826, 2018

work page arXiv 2018
[35]

An implementation of the codelet model,

J. Suettlerlein, S. Zuckerman, and G. R. Gao, “An implementation of the codelet model,” inEuropean Conference on Parallel Processing. Springer, 2013, pp. 633–644

2013
[36]

Earth: an efficient architecture for running threads,

K. B. Theobald, “Earth: an efficient architecture for running threads,” thesis, 1999

1999
[37]

Planning-oriented autonomous driving,

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862

2023
[38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023