Recognition: unknown
M100: An Orchestrated Dataflow Architecture Powering General AI Computing
Pith reviewed 2026-05-10 05:46 UTC · model grok-4.3
The pith
M100 uses compiler-orchestrated dataflow to enable efficient general AI inference for autonomous driving and large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
M100 is a dataflow parallel architecture that uses compiler-architecture co-design to orchestrate computation and, crucially, data movement across time and space. Tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, largely eliminating caching. With the tensor chosen as the fundamental scheduling unit, M100 demonstrates capability on UniAD for autonomous driving and LLaMA for LLMs, outperforming GPGPU architectures in AD applications through higher utilization.
What carries the argument
Orchestrated dataflow architecture with compiler-architecture co-design managing data streams at tensor granularity to eliminate most caching.
Load-bearing premise
The compiler-architecture co-design can orchestrate data movement and scheduling for diverse and changing AI workloads without excessive overhead or sacrificing broad applicability.
What would settle it
A benchmark comparison on a newly released AI model for autonomous driving or LLM inference where M100 fails to show higher utilization or requires significantly more compilation time than a GPGPU baseline.
Figures
read the original abstract
As deep learning-based AI technologies gain momentum, the demand for general-purpose AI computing architectures continues to grow. While GPGPU-based architectures offer versatility for diverse AI workloads, they often fall short in efficiency and cost-effectiveness. Various Domain-Specific Architectures (DSAs) excel at particular AI tasks but struggle to extend across broader applications or adapt to the rapidly evolving AI landscape. M100 is Li Auto's response: a performant, cost-effective architecture for AI inference in Autonomous Driving (AD), Large Language Models (LLMs), and intelligent human interactions, domains crucial to today's most competitive automobile platforms. M100 employs a dataflow parallel architecture, where compiler-architecture co-design orchestrates not only computation but, more critically, data movement across time and space. Leveraging dataflow computing efficiency, our hardware-software co-design improves system performance while reducing hardware complexity and cost. M100 largely eliminates caching: tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, yielding greater efficiency and scalability than cache-based systems. Another key principle was selecting the right operational granularity for scheduling, issuing, and execution across compiler, firmware, and hardware. Recognizing commonalities in AI workloads, we chose the tensor as the fundamental data element. M100 demonstrates general AI computing capability across diverse inference applications, including UniAD (for AD) and LLaMA (for LLMs). Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization, representing a promising direction for future general AI computing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces M100, a dataflow parallel architecture developed by Li Auto for general-purpose AI inference targeting autonomous driving (UniAD), large language models (LLaMA), and intelligent human interactions. It describes a compiler-architecture co-design that orchestrates data movement across time and space using compiler- and runtime-managed streams instead of caches, selects the tensor as the fundamental scheduling granularity, and claims this yields higher performance, utilization, and scalability than GPGPU architectures while reducing hardware complexity.
Significance. If the performance claims and generality hold, M100 would represent a meaningful contribution to AI accelerator design by demonstrating how dataflow orchestration and tensor-level co-design can bridge the gap between versatile but inefficient GPGPUs and narrow DSAs. The approach of replacing caches with managed streams and using tensor granularity addresses well-known inefficiencies in data movement for AI workloads and could inform future hardware-software co-design efforts in automotive and LLM domains.
major comments (2)
- [Abstract] Abstract: The central claim that 'Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization' is presented without any quantitative results, baselines, utilization percentages, methodology, error bars, or comparison details, rendering the performance advantage impossible to evaluate or reproduce.
- [Abstract] Abstract: The architecture description asserts that compiler- and runtime-managed data streams enable general AI computing across AD and LLMs, but provides no analysis of how static tensor scheduling and stream orchestration handle dynamic, data-dependent control flow in LLM inference (autoregressive generation, variable attention patterns) without prohibitive overhead or fallback mechanisms, which directly undermines the generality claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization' is presented without any quantitative results, baselines, utilization percentages, methodology, error bars, or comparison details, rendering the performance advantage impossible to evaluate or reproduce.
Authors: We agree that the abstract claim would be stronger with supporting quantitative context. The full manuscript provides detailed benchmarks in the evaluation sections, including comparisons to GPGPU baselines for UniAD workloads, utilization rates, and methodology. To improve evaluability, we will revise the abstract to include concise quantitative highlights (e.g., relative performance gains and utilization improvements) drawn from those results. revision: yes
-
Referee: [Abstract] Abstract: The architecture description asserts that compiler- and runtime-managed data streams enable general AI computing across AD and LLMs, but provides no analysis of how static tensor scheduling and stream orchestration handle dynamic, data-dependent control flow in LLM inference (autoregressive generation, variable attention patterns) without prohibitive overhead or fallback mechanisms, which directly undermines the generality claim.
Authors: The referee correctly notes the absence of explicit analysis on dynamic control flow. While the runtime-managed streams are intended to provide flexibility beyond purely static scheduling, the manuscript does not detail overheads for autoregressive generation or variable attention. We will add a dedicated subsection in the architecture or evaluation section analyzing these mechanisms, including how stream reconfiguration supports data-dependent patterns in LLaMA inference and associated costs. revision: yes
Circularity Check
No circularity: architecture paper lacks derivations or fitted predictions
full rationale
The paper is an architectural description of M100's dataflow design, compiler co-design, tensor granularity, and stream-based memory management. It presents no equations, no fitted parameters, no predictions derived from inputs, and no self-citation chains that reduce claims to prior work by the same authors. Performance assertions rest on unspecified benchmarks for UniAD and LLaMA rather than any self-referential construction. The central claims are therefore independent of the circularity patterns enumerated in the instructions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption AI workloads share commonalities best captured by tensor operations as the fundamental scheduling unit
Reference graph
Works this paper leans on
-
[1]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A vision-language-action 11 flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review arXiv 2022
-
[3]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183
2023
-
[4]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,”arXiv preprint arXiv:2312.13139, 2023
work page internal anchor Pith review arXiv 2023
-
[5]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yanget al., “Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation,”arXiv preprint arXiv:2410.06158, 2024
work page internal anchor Pith review arXiv 2024
-
[6]
NVIDIA Jetson AGX Orin Series, A Gi- ant Leap Forward for Robotics and Edge AI Applications, Techni- cal Brief,
L. S. Karumbunathan, “NVIDIA Jetson AGX Orin Series, A Gi- ant Leap Forward for Robotics and Edge AI Applications, Techni- cal Brief,” https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/ jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf, Jul. 2022, ac- cessed: 2025-08-21
2022
-
[7]
NVIDIA Jetson Thor,
NVIDIA Corporation, “NVIDIA Jetson Thor,” 2025, accessed: 2025- 08-21. [Online]. Available: https://www.nvidia.com/en-us/autonomous- machines/embedded-systems/jetson-thor/
2025
-
[8]
Computer and redundancy solution for the full self-driving computer,
P. Bannon, G. Venkataramanan, D. D. Sarma, and E. Talpes, “Computer and redundancy solution for the full self-driving computer,” in2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–22
2019
-
[9]
Samsung to make tesla’s hw 4.0 self-driving auto chip,
J.-S. Hwang, “Samsung to make tesla’s hw 4.0 self-driving auto chip,” https://www.kedglobal.com/semiconductors/newsView/ ked202109230009, 2023, accessed: 2025-08-21
2023
-
[10]
Elon musk reveals the first details about hardware 5 autopilot computer and sensors,
C. Agatie, “Elon musk reveals the first details about hardware 5 autopilot computer and sensors,” https://www.autoevolution.com/news/elon- musk-reveals-the-first-details-about-hardware-5-autopilot-computer- and-sensors-235405.html, 2024, accessed: 2025-08-21
2024
-
[11]
Data flow supercomputers
J. B. Dennis, “Data flow supercomputers.”Computer, vol. 13, no. 11, pp. 48–56, 1980
1980
-
[12]
Advances in the dataflow computational model,
W. A. Najjar, E. A. Lee, and G. R. Gao, “Advances in the dataflow computational model,”Parallel computing, vol. 25, no. 13-14, pp. 1907– 1929, 1999
1907
-
[13]
Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads,
D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, J. Hwang, R. Leslie- Hurd, M. Bye, E. Creswick, M. Boyd, M. Venigalla, E. Laforge, J. Purdy, P. Kamath, D. Maheshwari, M. Beidler, G. Rosseel, O. Ahmad, G. Gagarin, R. Czekalski, A. Rane, S. Parmar, J. Werner, J. Sproch, A. Macias, and B. Ku...
2020
-
[14]
A software-defined tensor streaming multiprocessor for large-scale machine learning,
D. Abts, G. Kimmell, A. Ling, J. Kim, M. Boyd, A. Bitar, S. Parmar, I. Ahmed, R. DiCecco, D. Han, J. Thompson, M. Bye, J. Hwang, J. Fowers, P. Lillian, A. Murthy, E. Mehtabuddin, C. Tekur, T. Sohmers, K. Kang, S. Maresh, and J. Ross, “A software-defined tensor streaming multiprocessor for large-scale machine learning,” in Proceedings of the 49th Annual In...
-
[15]
Plasticine: A reconfigurable architecture for parallel paterns,
R. Prabhakar, Y . Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel paterns,”ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 389–402, 2017
2017
-
[16]
Sambanova sn10 rdu: Accelerating soft- ware 2.0 with dataflow,
R. Prabhakar and S. Jairath, “Sambanova sn10 rdu: Accelerating soft- ware 2.0 with dataflow,” in2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 2021, pp. 1–37
2021
-
[17]
Sambanova sn40l: Scaling the ai memory wall with dataflow and composition of experts,
R. Prabhakar, R. Sivaramakrishnan, D. Gandhi, Y . Du, M. Wang, X. Song, K. Zhang, T. Gao, A. Wang, X. Liet al., “Sambanova sn40l: Scaling the ai memory wall with dataflow and composition of experts,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 1353–1366
2024
-
[18]
Intel Gaudi 3 AI Accelerator: Architected for Gen AI Training and Inference ,
S. Lie, “ Wafer-Scale AI: GPU Impossible Performance ,” in2024 IEEE Hot Chips 36 Symposium (HCS). Los Alamitos, CA, USA: IEEE Computer Society, Aug. 2024, pp. 1–71. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/HCS61935.2024.10664673
-
[19]
Cerebras architecture deep dive: First look inside the hard- ware/software co-design for deep learning,
——, “Cerebras architecture deep dive: First look inside the hard- ware/software co-design for deep learning,” inIEEE Micro, vol. 43, no. 3. IEEE, 2023, pp. 18–30
2023
-
[20]
Tenstorrent scales ai performance: Architecture leads in data-center power efficiency,
L. Gwennap, “Tenstorrent scales ai performance: Architecture leads in data-center power efficiency,”Microprocessor Report, Tech. Rep., apr, 2020
2020
-
[21]
Blackhole & tt-metalium: The standalone ai computer and its programming model,
J. Vasiljevic and D. Capalija, “Blackhole & tt-metalium: The standalone ai computer and its programming model,” in2024 IEEE Hot Chips 36 Symposium (HCS). IEEE Computer Society Los Alamitos, CA, USA, 2024, pp. 1–30
2024
-
[22]
The microarchitecture of dojo, tesla’s exa-scale computer,
E. Talpes, D. D. Sarma, D. Williams, S. Arora, T. Kunjan, B. Floering, A. Jalote, C. Hsiong, C. Poorna, V . Samantet al., “The microarchitecture of dojo, tesla’s exa-scale computer,”IEEE Micro, vol. 43, no. 3, pp. 31– 39, 2023
2023
-
[23]
Amd xdna™ npu in ryzen™ ai processors,
A. Rico, S. Pareek, J. Cabezas, D. Clarke, B. Ozgul, F. Barat, Y . Fu, S. M ¨unz, D. Stuart, P. Schlangenet al., “Amd xdna™ npu in ryzen™ ai processors,”IEEE Micro, 2024
2024
-
[24]
Evaluation of xilinx versal architecture for next-gen edge computing in space,
N. Perryman, C. Wilson, and A. George, “Evaluation of xilinx versal architecture for next-gen edge computing in space,” in2023 IEEE aerospace conference. IEEE, 2023, pp. 1–11
2023
-
[25]
Neu- ronflow: A hybrid neuromorphic–dataflow processor architecture for ai workloads,
O. Moreira, A. Yousefzadeh, F. Chersi, A. Kapoor, R.-J. Zwartenkot, P. Qiao, G. Cinserin, M. A. Khoei, M. Lindwer, and J. Tapson, “Neu- ronflow: A hybrid neuromorphic–dataflow processor architecture for ai workloads,” in2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2020, pp. 01–05
2020
-
[26]
In-datacenter performance analysis of a tensor processing unit,
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12
2017
-
[27]
Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware sup- port for embeddings,
N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towleset al., “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware sup- port for embeddings,” inProceedings of the 50th annual international symposium on computer architecture, 2023, pp. 1–14
2023
-
[28]
Mtia: First generation silicon targeting meta’s recommendation systems,
A. Firoozshahian, J. Coburn, R. Levenstein, R. Nattoji, A. Kamath, O. Wu, G. Grewal, H. Aepala, B. Jakka, B. Dreyeret al., “Mtia: First generation silicon targeting meta’s recommendation systems,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13
2023
-
[29]
Meta’s second generation ai chip: Model-chip co-design and productionization experiences,
J. Coburn, C. Tang, S. A. Asal, N. Agrawal, R. Chinta, H. Dixit, B. Dodds, S. Dwarakapuram, A. Firoozshahian, C. Gaoet al., “Meta’s second generation ai chip: Model-chip co-design and productionization experiences,” inProceedings of the 52nd Annual International Sympo- sium on Computer Architecture, 2025, pp. 1689–1702
2025
-
[30]
Pact xpp—a self-reconfigurable data processing architecture,
V . Baumgarte, G. Ehlers, F. May, A. N ¨uckel, M. V orbach, and M. Wein- hardt, “Pact xpp—a self-reconfigurable data processing architecture,”the Journal of Supercomputing, vol. 26, no. 2, pp. 167–184, 2003
2003
-
[31]
Dynamically spe- cialized datapaths for energy efficient computing,
V . Govindaraju, C.-H. Ho, and K. Sankaralingam, “Dynamically spe- cialized datapaths for energy efficient computing,” in2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 2011, pp. 503–514
2011
-
[32]
Morphosys: an integrated reconfigurable system for data- parallel and computation-intensive applications,
H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho, “Morphosys: an integrated reconfigurable system for data- parallel and computation-intensive applications,”IEEE transactions on computers, vol. 49, no. 5, pp. 465–481, 2000
2000
-
[33]
The gpu computing era,
J. Nickolls and W. J. Dally, “The gpu computing era,”IEEE micro, vol. 30, no. 2, pp. 56–69, 2010
2010
- [34]
-
[35]
An implementation of the codelet model,
J. Suettlerlein, S. Zuckerman, and G. R. Gao, “An implementation of the codelet model,” inEuropean Conference on Parallel Processing. Springer, 2013, pp. 633–644
2013
-
[36]
Earth: an efficient architecture for running threads,
K. B. Theobald, “Earth: an efficient architecture for running threads,” thesis, 1999
1999
-
[37]
Planning-oriented autonomous driving,
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862
2023
-
[38]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.