pith. machine review for the scientific record. sign in

arxiv: 2604.17862 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AR

Recognition: unknown

M100: An Orchestrated Dataflow Architecture Powering General AI Computing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AR
keywords dataflow architectureAI inferenceautonomous drivinglarge language modelscompiler-architecture co-designtensor-based schedulinggeneral AI computingM100
0
0 comments X

The pith

M100 uses compiler-orchestrated dataflow to enable efficient general AI inference for autonomous driving and large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents M100 as a new architecture aimed at providing versatile and efficient computing for AI inference tasks. It claims that a dataflow design, combined with tight compiler and hardware integration, allows explicit management of data movement instead of relying on caches, leading to improved performance and reduced complexity. This matters for applications in autonomous vehicles where both driving perception and language-based interactions are needed, as well as for running large models. Benchmarks indicate it surpasses standard GPU setups in utilization for driving tasks. If successful, it points to dataflow methods as a way to balance generality and efficiency in AI hardware.

Core claim

M100 is a dataflow parallel architecture that uses compiler-architecture co-design to orchestrate computation and, crucially, data movement across time and space. Tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, largely eliminating caching. With the tensor chosen as the fundamental scheduling unit, M100 demonstrates capability on UniAD for autonomous driving and LLaMA for LLMs, outperforming GPGPU architectures in AD applications through higher utilization.

What carries the argument

Orchestrated dataflow architecture with compiler-architecture co-design managing data streams at tensor granularity to eliminate most caching.

Load-bearing premise

The compiler-architecture co-design can orchestrate data movement and scheduling for diverse and changing AI workloads without excessive overhead or sacrificing broad applicability.

What would settle it

A benchmark comparison on a newly released AI model for autonomous driving or LLM inference where M100 fails to show higher utilization or requires significantly more compilation time than a GPGPU baseline.

Figures

Figures reproduced from arXiv: 2604.17862 by Changkui Mao, Changsong Wu, Chao Lu, Chao Suo, Cheng Qian, Chun Yang, Danyang Zhu, Hengchang Xiong, Hongzhan Lu, Hongzhen Liu, Jiafu Liu, Jie Chen, Jie Dai, Junfeng Tang, Kai Liu, Kun Li, Lipeng Ge, Meng Sun, Min Luo, Peng Chen, Peng Wang, Shaodong Yang, Shibin Tang, Shibo Chen, Weikang Zhang, Xiaobo Du, Xiao Ling, Xin Wu, Yang Liu, Yan Xie, Yihua Jin, Yi Jiang, Yin Huang, Yuli Zhang, Zhen Yuan, Zhiyuan Man, Zhongxiao Yao.

Figure 1
Figure 1. Figure 1: Computing blocks in M100, each composed of three computing [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the M100 NPU memory system without multi-level [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: The high level architecture of M100 NPU. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: The high level block diagram of M100 SoC. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: illustrates the structure of a TPB cluster. There are two main reasons for introducing a cluster-level hierar￾chy. First, four TPBs can share common resources—such as the instruction buffer, ICB and DRB nodes, and a RISC-V CPU—allowing more silicon area to be allocated to tensor processing, thus improving compute density. Second, close Cluster Instruction Chain Bus Node Cluster Instruction Queue (CIQ) TPB … view at source ↗
Figure 9
Figure 9. Figure 9: Architecture of the HBSM. The 2MB HBSM SRAM is uniformly shared across all TPB functional units. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example of 3-level TWU. sequences, enabling efficient access to both input activations and weights. Typically, a TPB functional unit has two or more input TWUs, and one output TWU. The TWU can be configured by the TPB instruction for that functional unit, specifying the number of nested loop levels, and the Initial, Step, and Final values for each loop level. After configuration, the TWU generates one … view at source ↗
Figure 11
Figure 11. Figure 11: Architecture of the TCU. The Tensor Computing Unit (TCU) accelerates tensor con￾traction operations using a dense array of compute elements. To sustain high throughput under limited memory bandwidth, data reuse is essential. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The architecture of CVU [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Overview of M100 AI compiler toolchain. • The space-time scheduler maps a neural network sub￾graph onto the M100 NPU hardware. If necessary, large tensors are partitioned into mini-tensors that are passed 8 [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Space-time scheduler subgraph mapping and tensor streaming on [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: UniAD Framework. indicate sustained activity, while gaps denote idle or waiting periods. In this trace, throughout most of the sampling window, the DMAs in the CCB—along with the TCU, CVU, CSU, and GSDU in one of the TPBs—remain continuously active, with substantial overlap in task execution. This indicates high hardware utilization and highlights the M100 architecture’s strong parallel execution capabili… view at source ↗
Figure 16
Figure 16. Figure 16: Detailed execution trace of M100 TPB instructions collected by the [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
read the original abstract

As deep learning-based AI technologies gain momentum, the demand for general-purpose AI computing architectures continues to grow. While GPGPU-based architectures offer versatility for diverse AI workloads, they often fall short in efficiency and cost-effectiveness. Various Domain-Specific Architectures (DSAs) excel at particular AI tasks but struggle to extend across broader applications or adapt to the rapidly evolving AI landscape. M100 is Li Auto's response: a performant, cost-effective architecture for AI inference in Autonomous Driving (AD), Large Language Models (LLMs), and intelligent human interactions, domains crucial to today's most competitive automobile platforms. M100 employs a dataflow parallel architecture, where compiler-architecture co-design orchestrates not only computation but, more critically, data movement across time and space. Leveraging dataflow computing efficiency, our hardware-software co-design improves system performance while reducing hardware complexity and cost. M100 largely eliminates caching: tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, yielding greater efficiency and scalability than cache-based systems. Another key principle was selecting the right operational granularity for scheduling, issuing, and execution across compiler, firmware, and hardware. Recognizing commonalities in AI workloads, we chose the tensor as the fundamental data element. M100 demonstrates general AI computing capability across diverse inference applications, including UniAD (for AD) and LLaMA (for LLMs). Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization, representing a promising direction for future general AI computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces M100, a dataflow parallel architecture developed by Li Auto for general-purpose AI inference targeting autonomous driving (UniAD), large language models (LLaMA), and intelligent human interactions. It describes a compiler-architecture co-design that orchestrates data movement across time and space using compiler- and runtime-managed streams instead of caches, selects the tensor as the fundamental scheduling granularity, and claims this yields higher performance, utilization, and scalability than GPGPU architectures while reducing hardware complexity.

Significance. If the performance claims and generality hold, M100 would represent a meaningful contribution to AI accelerator design by demonstrating how dataflow orchestration and tensor-level co-design can bridge the gap between versatile but inefficient GPGPUs and narrow DSAs. The approach of replacing caches with managed streams and using tensor granularity addresses well-known inefficiencies in data movement for AI workloads and could inform future hardware-software co-design efforts in automotive and LLM domains.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization' is presented without any quantitative results, baselines, utilization percentages, methodology, error bars, or comparison details, rendering the performance advantage impossible to evaluate or reproduce.
  2. [Abstract] Abstract: The architecture description asserts that compiler- and runtime-managed data streams enable general AI computing across AD and LLMs, but provides no analysis of how static tensor scheduling and stream orchestration handle dynamic, data-dependent control flow in LLM inference (autoregressive generation, variable attention patterns) without prohibitive overhead or fallback mechanisms, which directly undermines the generality claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization' is presented without any quantitative results, baselines, utilization percentages, methodology, error bars, or comparison details, rendering the performance advantage impossible to evaluate or reproduce.

    Authors: We agree that the abstract claim would be stronger with supporting quantitative context. The full manuscript provides detailed benchmarks in the evaluation sections, including comparisons to GPGPU baselines for UniAD workloads, utilization rates, and methodology. To improve evaluability, we will revise the abstract to include concise quantitative highlights (e.g., relative performance gains and utilization improvements) drawn from those results. revision: yes

  2. Referee: [Abstract] Abstract: The architecture description asserts that compiler- and runtime-managed data streams enable general AI computing across AD and LLMs, but provides no analysis of how static tensor scheduling and stream orchestration handle dynamic, data-dependent control flow in LLM inference (autoregressive generation, variable attention patterns) without prohibitive overhead or fallback mechanisms, which directly undermines the generality claim.

    Authors: The referee correctly notes the absence of explicit analysis on dynamic control flow. While the runtime-managed streams are intended to provide flexibility beyond purely static scheduling, the manuscript does not detail overheads for autoregressive generation or variable attention. We will add a dedicated subsection in the architecture or evaluation section analyzing these mechanisms, including how stream reconfiguration supports data-dependent patterns in LLaMA inference and associated costs. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture paper lacks derivations or fitted predictions

full rationale

The paper is an architectural description of M100's dataflow design, compiler co-design, tensor granularity, and stream-based memory management. It presents no equations, no fitted parameters, no predictions derived from inputs, and no self-citation chains that reduce claims to prior work by the same authors. Performance assertions rest on unspecified benchmarks for UniAD and LLaMA rather than any self-referential construction. The central claims are therefore independent of the circularity patterns enumerated in the instructions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on domain assumptions about AI workload commonalities and dataflow efficiency but introduces no free parameters, invented entities, or ad-hoc axioms beyond standard computer architecture premises.

axioms (1)
  • domain assumption AI workloads share commonalities best captured by tensor operations as the fundamental scheduling unit
    Explicitly stated when selecting tensor granularity for scheduling, issuing, and execution.

pith-pipeline@v0.9.0 · 5702 in / 1206 out tokens · 45144 ms · 2026-05-10T05:46:47.727281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A vision-language-action 11 flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  2. [2]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

  3. [3]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

  4. [4]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,”arXiv preprint arXiv:2312.13139, 2023

  5. [5]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yanget al., “Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation,”arXiv preprint arXiv:2410.06158, 2024

  6. [6]

    NVIDIA Jetson AGX Orin Series, A Gi- ant Leap Forward for Robotics and Edge AI Applications, Techni- cal Brief,

    L. S. Karumbunathan, “NVIDIA Jetson AGX Orin Series, A Gi- ant Leap Forward for Robotics and Edge AI Applications, Techni- cal Brief,” https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/ jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf, Jul. 2022, ac- cessed: 2025-08-21

  7. [7]

    NVIDIA Jetson Thor,

    NVIDIA Corporation, “NVIDIA Jetson Thor,” 2025, accessed: 2025- 08-21. [Online]. Available: https://www.nvidia.com/en-us/autonomous- machines/embedded-systems/jetson-thor/

  8. [8]

    Computer and redundancy solution for the full self-driving computer,

    P. Bannon, G. Venkataramanan, D. D. Sarma, and E. Talpes, “Computer and redundancy solution for the full self-driving computer,” in2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–22

  9. [9]

    Samsung to make tesla’s hw 4.0 self-driving auto chip,

    J.-S. Hwang, “Samsung to make tesla’s hw 4.0 self-driving auto chip,” https://www.kedglobal.com/semiconductors/newsView/ ked202109230009, 2023, accessed: 2025-08-21

  10. [10]

    Elon musk reveals the first details about hardware 5 autopilot computer and sensors,

    C. Agatie, “Elon musk reveals the first details about hardware 5 autopilot computer and sensors,” https://www.autoevolution.com/news/elon- musk-reveals-the-first-details-about-hardware-5-autopilot-computer- and-sensors-235405.html, 2024, accessed: 2025-08-21

  11. [11]

    Data flow supercomputers

    J. B. Dennis, “Data flow supercomputers.”Computer, vol. 13, no. 11, pp. 48–56, 1980

  12. [12]

    Advances in the dataflow computational model,

    W. A. Najjar, E. A. Lee, and G. R. Gao, “Advances in the dataflow computational model,”Parallel computing, vol. 25, no. 13-14, pp. 1907– 1929, 1999

  13. [13]

    Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads,

    D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, J. Hwang, R. Leslie- Hurd, M. Bye, E. Creswick, M. Boyd, M. Venigalla, E. Laforge, J. Purdy, P. Kamath, D. Maheshwari, M. Beidler, G. Rosseel, O. Ahmad, G. Gagarin, R. Czekalski, A. Rane, S. Parmar, J. Werner, J. Sproch, A. Macias, and B. Ku...

  14. [14]

    A software-defined tensor streaming multiprocessor for large-scale machine learning,

    D. Abts, G. Kimmell, A. Ling, J. Kim, M. Boyd, A. Bitar, S. Parmar, I. Ahmed, R. DiCecco, D. Han, J. Thompson, M. Bye, J. Hwang, J. Fowers, P. Lillian, A. Murthy, E. Mehtabuddin, C. Tekur, T. Sohmers, K. Kang, S. Maresh, and J. Ross, “A software-defined tensor streaming multiprocessor for large-scale machine learning,” in Proceedings of the 49th Annual In...

  15. [15]

    Plasticine: A reconfigurable architecture for parallel paterns,

    R. Prabhakar, Y . Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel paterns,”ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 389–402, 2017

  16. [16]

    Sambanova sn10 rdu: Accelerating soft- ware 2.0 with dataflow,

    R. Prabhakar and S. Jairath, “Sambanova sn10 rdu: Accelerating soft- ware 2.0 with dataflow,” in2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 2021, pp. 1–37

  17. [17]

    Sambanova sn40l: Scaling the ai memory wall with dataflow and composition of experts,

    R. Prabhakar, R. Sivaramakrishnan, D. Gandhi, Y . Du, M. Wang, X. Song, K. Zhang, T. Gao, A. Wang, X. Liet al., “Sambanova sn40l: Scaling the ai memory wall with dataflow and composition of experts,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 1353–1366

  18. [18]

    Intel Gaudi 3 AI Accelerator: Architected for Gen AI Training and Inference ,

    S. Lie, “ Wafer-Scale AI: GPU Impossible Performance ,” in2024 IEEE Hot Chips 36 Symposium (HCS). Los Alamitos, CA, USA: IEEE Computer Society, Aug. 2024, pp. 1–71. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/HCS61935.2024.10664673

  19. [19]

    Cerebras architecture deep dive: First look inside the hard- ware/software co-design for deep learning,

    ——, “Cerebras architecture deep dive: First look inside the hard- ware/software co-design for deep learning,” inIEEE Micro, vol. 43, no. 3. IEEE, 2023, pp. 18–30

  20. [20]

    Tenstorrent scales ai performance: Architecture leads in data-center power efficiency,

    L. Gwennap, “Tenstorrent scales ai performance: Architecture leads in data-center power efficiency,”Microprocessor Report, Tech. Rep., apr, 2020

  21. [21]

    Blackhole & tt-metalium: The standalone ai computer and its programming model,

    J. Vasiljevic and D. Capalija, “Blackhole & tt-metalium: The standalone ai computer and its programming model,” in2024 IEEE Hot Chips 36 Symposium (HCS). IEEE Computer Society Los Alamitos, CA, USA, 2024, pp. 1–30

  22. [22]

    The microarchitecture of dojo, tesla’s exa-scale computer,

    E. Talpes, D. D. Sarma, D. Williams, S. Arora, T. Kunjan, B. Floering, A. Jalote, C. Hsiong, C. Poorna, V . Samantet al., “The microarchitecture of dojo, tesla’s exa-scale computer,”IEEE Micro, vol. 43, no. 3, pp. 31– 39, 2023

  23. [23]

    Amd xdna™ npu in ryzen™ ai processors,

    A. Rico, S. Pareek, J. Cabezas, D. Clarke, B. Ozgul, F. Barat, Y . Fu, S. M ¨unz, D. Stuart, P. Schlangenet al., “Amd xdna™ npu in ryzen™ ai processors,”IEEE Micro, 2024

  24. [24]

    Evaluation of xilinx versal architecture for next-gen edge computing in space,

    N. Perryman, C. Wilson, and A. George, “Evaluation of xilinx versal architecture for next-gen edge computing in space,” in2023 IEEE aerospace conference. IEEE, 2023, pp. 1–11

  25. [25]

    Neu- ronflow: A hybrid neuromorphic–dataflow processor architecture for ai workloads,

    O. Moreira, A. Yousefzadeh, F. Chersi, A. Kapoor, R.-J. Zwartenkot, P. Qiao, G. Cinserin, M. A. Khoei, M. Lindwer, and J. Tapson, “Neu- ronflow: A hybrid neuromorphic–dataflow processor architecture for ai workloads,” in2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2020, pp. 01–05

  26. [26]

    In-datacenter performance analysis of a tensor processing unit,

    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

  27. [27]

    Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware sup- port for embeddings,

    N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towleset al., “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware sup- port for embeddings,” inProceedings of the 50th annual international symposium on computer architecture, 2023, pp. 1–14

  28. [28]

    Mtia: First generation silicon targeting meta’s recommendation systems,

    A. Firoozshahian, J. Coburn, R. Levenstein, R. Nattoji, A. Kamath, O. Wu, G. Grewal, H. Aepala, B. Jakka, B. Dreyeret al., “Mtia: First generation silicon targeting meta’s recommendation systems,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13

  29. [29]

    Meta’s second generation ai chip: Model-chip co-design and productionization experiences,

    J. Coburn, C. Tang, S. A. Asal, N. Agrawal, R. Chinta, H. Dixit, B. Dodds, S. Dwarakapuram, A. Firoozshahian, C. Gaoet al., “Meta’s second generation ai chip: Model-chip co-design and productionization experiences,” inProceedings of the 52nd Annual International Sympo- sium on Computer Architecture, 2025, pp. 1689–1702

  30. [30]

    Pact xpp—a self-reconfigurable data processing architecture,

    V . Baumgarte, G. Ehlers, F. May, A. N ¨uckel, M. V orbach, and M. Wein- hardt, “Pact xpp—a self-reconfigurable data processing architecture,”the Journal of Supercomputing, vol. 26, no. 2, pp. 167–184, 2003

  31. [31]

    Dynamically spe- cialized datapaths for energy efficient computing,

    V . Govindaraju, C.-H. Ho, and K. Sankaralingam, “Dynamically spe- cialized datapaths for energy efficient computing,” in2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 2011, pp. 503–514

  32. [32]

    Morphosys: an integrated reconfigurable system for data- parallel and computation-intensive applications,

    H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho, “Morphosys: an integrated reconfigurable system for data- parallel and computation-intensive applications,”IEEE transactions on computers, vol. 49, no. 5, pp. 465–481, 2000

  33. [33]

    The gpu computing era,

    J. Nickolls and W. J. Dally, “The gpu computing era,”IEEE micro, vol. 30, no. 2, pp. 56–69, 2010

  34. [34]

    Scarpazza

    Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the nvidia volta gpu architecture via microbenchmarking,”arXiv preprint arXiv:1804.06826, 2018

  35. [35]

    An implementation of the codelet model,

    J. Suettlerlein, S. Zuckerman, and G. R. Gao, “An implementation of the codelet model,” inEuropean Conference on Parallel Processing. Springer, 2013, pp. 633–644

  36. [36]

    Earth: an efficient architecture for running threads,

    K. B. Theobald, “Earth: an efficient architecture for running threads,” thesis, 1999

  37. [37]

    Planning-oriented autonomous driving,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862

  38. [38]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023. 12