arxiv: 2604.26103 · v2 · submitted 2026-04-28 · 💻 cs.AR · cs.AI· cs.DC· cs.LG

Recognition: unknown

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Zhongkai Yu , Haotian Ye , Chenyang Zhou , Ohm Rishabh Venkatachalam , Zaifeng Pan , Zhengding Hu , Junsung Kim , Won Woo Ro

show 4 more authors

Po-An Tsai Shuyi Pei Yangwook Kang Yufei Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:11 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.LG

keywords multi-chiplet architecturememory-centric designlong-context attentionLLM servingHBM-PNMdecode-phase optimizationhybrid parallelism

0 comments

The pith

A memory-centric multi-chiplet architecture replaces GPU dies with HBM-PNM cubes to cut attention latency for million-token contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that GPU-centered systems waste power and area on compute units idle during memory-bound decode attention, especially as contexts reach one million tokens. AMMA instead builds around HBM-PNM cubes that double available memory bandwidth and adds a logic-die microarchitecture plus two-level hybrid parallelism and reordered collective flow to convert that bandwidth into real gains. A sympathetic reader would see this as a direct attack on the primary latency bottleneck in long-context LLM serving. If correct, the design would let attention run at far lower latency and energy than current platforms without relying on compute-heavy GPUs.

Core claim

AMMA replaces GPU compute dies with HBM-PNM cubes to roughly double memory bandwidth for decode-phase attention. A logic-die microarchitecture exploits per-cube internal bandwidth under tight power and area limits, while a two-level hybrid parallelism scheme and reordered collective flow cut intra-chip communication costs. Design-space exploration over compute power and die-to-die link bandwidth guides hardware choices. The result is 15.5 times lower attention latency and 6.9 times lower energy use than a standard GPU baseline on long-context workloads.

What carries the argument

HBM-PNM cubes paired with a custom logic-die microarchitecture that routes decode attention directly to internal memory bandwidth, supported by two-level hybrid parallelism and reordered collective flow to minimize die-to-die traffic.

If this is right

Decode attention for 1M-token contexts becomes feasible at interactive latencies on memory-centric hardware.
Power and die area currently spent on idle compute units in GPUs can be reallocated to more memory bandwidth.
Hardware designers gain concrete guidance on per-cube compute power and intra-chip link bandwidth trade-offs.
Serving systems can shift from GPU-centric to memory-centric designs without losing attention performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory-centric approach could apply to other memory-bound phases such as KV cache management or retrieval-augmented generation.
Future chiplet stacks might generalize the two-level parallelism pattern to additional AI primitives beyond attention.
If the bandwidth-to-performance translation holds, it suggests a broader shift away from compute-rich dies for inference workloads.

Load-bearing premise

The proposed logic-die microarchitecture, hybrid parallelism, and reordered flow can turn the extra per-cube bandwidth into proportional latency and energy reductions without large unmodeled overheads once built in silicon.

What would settle it

Fabricate a prototype of the AMMA chiplet stack, run 1M-token decode attention workloads on it, and compare measured latency and energy directly against an equivalent GPU baseline under identical conditions.

Figures

Figures reproduced from arXiv: 2604.26103 by Chenyang Zhou, Haotian Ye, Junsung Kim, Ohm Rishabh Venkatachalam, Po-An Tsai, Shuyi Pei, Won Woo Ro, Yangwook Kang, Yufei Ding, Zaifeng Pan, Zhengding Hu, Zhongkai Yu.

**Figure 1.** Figure 1: (a) Existing serving systems rely on GPUs for decode attention. AMMA replaces compute dies with HBM-PNM cubes view at source ↗

**Figure 2.** Figure 2: PIM/PNM architecture overview. (a) Compute and HBM BW utilization (b) Power breakdown (estimated) view at source ↗

**Figure 3.** Figure 3: Profiling of H100 (a) hardware utilization and (b) view at source ↗

**Figure 4.** Figure 4: A roofline analysis of Rubin and AMMA. 3.2 Power Breakdown of GPU To understand how GPUs spend their power budget, we profile an H100 running Qwen3-235B [61] attention at batch size 1, using NVML [39] for measurement and CACTI [35] for modeling. The result in view at source ↗

**Figure 6.** Figure 6: AMMA architecture hierarchy. (a) Package and cube level integration and (b) Core and SA level microarchitecture. view at source ↗

**Figure 7.** Figure 7: (a) Three conventional SA dataflows: WS, OS, IS. view at source ↗

**Figure 8.** Figure 8: Parallelism design for AMMA. (a) Naïve TP16 distributes every attention stage across all 16 cubes. (b) Our two-level view at source ↗

**Figure 9.** Figure 9: Reordered collective operation flow for proj O. (a) Default two-level hybrid flow. (b) Our reordered flow. view at source ↗

**Figure 10.** Figure 10: Decode latency speedup, normalized to H100. view at source ↗

**Figure 12.** Figure 12: Ablation study because RO yields a fixed latency reduction that is increasingly diluted by attention compute as sequences grow. Since most gains stem from reduced communication, view at source ↗

**Figure 13.** Figure 13: Per-layer decode latency breakdown on Qwen3 view at source ↗

**Figure 14.** Figure 14: Batch size exploration for Qwen3-235B at seq=64K. view at source ↗

read the original abstract

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AMMA pushes a memory-centric HBM-PNM chiplet design for 1M-context decode attention with tailored logic dies and reordered collectives, but the 15.5X latency and 6.9X energy numbers rest on simulation assumptions that may undercount D2D and control overheads.

read the letter

The paper's main move is to stop treating the GPU as the central hub and instead center the system on HBM-PNM cubes that roughly double per-cube bandwidth for memory-bound attention. This directly targets the mismatch in current serving stacks where compute-rich dies sit idle during decode. The concrete contributions are the logic-die microarchitecture that tries to use internal bandwidth under tight power and area limits, the two-level hybrid parallelism scheme, and the reordered collective flow meant to cut intra-chip communication. They also run a design-space exploration on compute power and D2D link bandwidth, which gives hardware people some usable guidance on trade-offs. Those elements feel like genuine additions beyond prior PIM or GPU-disaggregated work. The headline results are 15.5X lower attention latency and 6.9X lower energy versus H100. If the underlying models are accurate, the gains would matter for long-context serving. The paper presents these as outcomes of the proposed components rather than fitted parameters, which is cleaner than some architecture papers. The soft spot is that the speedups assume the new microarchitecture and parallelism deliver near-ideal bandwidth use with only small unmodeled costs from die-to-die links, control logic, and scaling effects. The stress-test concern about realistic multi-chiplet overheads is reasonable here; without measured silicon data or more conservative modeling, the numbers stay projections. This is aimed at hardware architects and system designers working on next-generation accelerators, not at people who need immediate software changes. A reader already following PIM/PNM or memory-centric LLM hardware would get value from the details and the exploration. It deserves peer review because the problem statement is solid, the approach is distinct, and referees can pressure-test the simulation assumptions and suggest concrete validation steps.

Referee Report

1 major / 1 minor

Summary. The paper proposes AMMA, a multi-chiplet memory-centric architecture for low-latency 1M-context attention serving in LLMs. It replaces GPU compute dies with HBM-PNM cubes to roughly double per-cube memory bandwidth, introduces a logic-die microarchitecture optimized for decode attention under tight power/area budgets, a two-level hybrid parallelism scheme, and a reordered collective flow to minimize intra-chip D2D communication. A design-space exploration over per-cube compute power and D2D link bandwidth is performed, with evaluations claiming 15.5X lower attention latency and 6.9X lower energy consumption versus the NVIDIA H100.

Significance. If the performance claims hold after validation, the work would be significant for computer architecture and LLM systems research by directly targeting the memory-bound nature of long-context decode attention rather than relying on GPU-centric designs. The explicit design-space exploration over compute power and intra-chip D2D bandwidth provides actionable guidance for hardware designers and is a clear strength. The approach of using HBM-PNM cubes with custom logic dies offers a concrete alternative to current disaggregated serving systems.

major comments (1)

[Evaluation] Evaluation section: The headline claims of 15.5X latency reduction and 6.9X energy reduction versus H100 rest on the logic-die microarchitecture, two-level hybrid parallelism, and reordered collective flow converting doubled HBM bandwidth into proportional gains. However, the design-space exploration does not appear to include stress-testing against measured multi-chiplet D2D link latencies, control logic costs, or utilization losses at scale; without these, the proportional gains remain projections rather than demonstrated outcomes.

minor comments (1)

[Abstract] Abstract: The quantitative claims are presented without any reference to workloads, baseline configurations, or simulation methodology; adding a single sentence summarizing these would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of targeting memory-bound decode attention with a memory-centric multi-chiplet design. We address the single major comment below with clarifications on our evaluation methodology and modeling choices.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline claims of 15.5X latency reduction and 6.9X energy reduction versus H100 rest on the logic-die microarchitecture, two-level hybrid parallelism, and reordered collective flow converting doubled HBM bandwidth into proportional gains. However, the design-space exploration does not appear to include stress-testing against measured multi-chiplet D2D link latencies, control logic costs, or utilization losses at scale; without these, the proportional gains remain projections rather than demonstrated outcomes.

Authors: We thank the referee for this observation. Our design-space exploration (Section 5) systematically varies intra-chip D2D link bandwidth over a wide range of values while holding other parameters fixed; this directly exercises the sensitivity of end-to-end latency and energy to communication performance, serving as a proxy for link-latency stress testing. The logic-die microarchitecture is deliberately minimal, with control logic area and power explicitly budgeted and subtracted from the per-cube envelope so that overheads are not ignored. Utilization is evaluated at full 1 M-token scale under the two-level hybrid parallelism and reordered collective schedule; the reported speedups already reflect the achieved utilization after all communication and synchronization costs. Because AMMA is a forward-looking proposal, we rely on validated cycle-accurate simulation rather than silicon measurements of future D2D links. We will add a short subsection in the revised manuscript that explicitly tabulates the control-logic overhead assumptions and includes an additional sensitivity sweep on utilization under pessimistic D2D latency assumptions. revision: partial

Circularity Check

0 steps flagged

No circularity: performance claims arise from design-space exploration of the proposed architecture, not from self-referential definitions or fitted inputs.

full rationale

The manuscript presents AMMA as a hardware architecture proposal whose latency and energy improvements versus H100 are obtained via design-space exploration over compute power and D2D bandwidth. No equations, fitted parameters, or self-citations appear as load-bearing steps that reduce the headline claims to their own inputs by construction. The central results are simulation outcomes of the described logic-die microarchitecture, hybrid parallelism, and reordered collectives; they are not tautological renamings or predictions forced by prior self-citations. External comparison to a commercial GPU further keeps the evaluation independent of the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work rests on standard assumptions about HBM bandwidth scaling and D2D link costs that are not quantified here.

pith-pipeline@v0.9.0 · 5624 in / 1090 out tokens · 54009 ms · 2026-05-07T14:11:56.607029+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 20 canonical work pages · 8 internal anchors

[1]

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills.arXiv preprint arXiv:2308.16369 (2023)

work page arXiv 2023
[2]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

2023
[3]

Anthropic. 2026. Introducing Claude Sonnet 4.6. https://www.anthropic.com/ news/claude-sonnet-4-6 Accessed: 2026-04-05

2026
[4]

Kyle Aubrey and Farshad Ghodsian. 2026. Inside NVIDIA Groq 3 LPX: The Low- Latency Inference Accelerator for the NVIDIA Vera Rubin Platform. NVIDIA Technical Blog. https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the- low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/

2026
[5]

BloombergNEF. 2025. Power for AI: Easier Said Than Built. https://about.bnef. com/insights/commodities/power-for-ai-easier-said-than-built/. Accessed: Apr. 2026

2025
[6]

CBRE. 2025. Global Data Center Trends 2025. https://www.cbre.com/insights/ reports/global-data-center-trends-2025. Accessed: Apr. 2026

2025
[7]

Liyan Chen, Dongxu Lyu, Zhenyu Li, Jianfei Jiang, Qin Wang, Zhigang Mao, and Naifeng Jing. 2025. AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7

2025
[8]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review arXiv 2025
[9]

Guohao Dai, Ke Hong, Qiuli Mao, Xiuhong Li, Jiaming Xu, Haofeng Huang, Hongtu Xia, Xuefei Ning, Shengen Yan, Yun Liang, et al . 2025. FlashDecod- ing++Next: High Throughput LLM Inference with Latency and Memory Opti- mization.IEEE Trans. Comput.(2025)

2025
[10]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems

2022
[11]

Fabrice Devaux. 2019. The true processing in memory accelerator. In2019 IEEE Hot Chips 31 Symposium (HCS). IEEE, 1–24

2019
[12]

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. 2025. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298(2025)

work page internal anchor Pith review arXiv 2025
[13]

Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009(2025)

work page arXiv 2025
[14]

Siyuan He, Zhantong Zhu, Yandong He, and Tianyu Jia. 2025. LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture- Dataflow Co-Optimization. In2025 IEEE/ACM International Conference On Com- puter Aided Design (ICCAD). IEEE, 1–9

2025
[15]

Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, and Jongse Park. 2024. Ne- uPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing. InProceedings of the 29th ACM International Conference on Architectural Sup- port for Programming Languages and Operating Systems, Volume 3. https: ...

work page doi:10.1145/3620666.3651380 2024
[16]

Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. 2024. Flashdecoding++: Faster large language model inference with asynchronization, flat gemm optimization, and heuristics. Proceedings of Machine Learning and Systems6 (2024), 148–161

2024
[17]

JEDEC Solid State Technology Association. 2024. HBM4 High Bandwidth Memory DRAM Standard. JESD235E

2024
[18]

Sunghwan Joo, Jinyeon Kim, Yongsun Lee, Ji-Young Kim, Youngsik Lee, Yong-Min Kim, ChiSung Oh, Kyu-Ha Shim, Haesuk Lee, Young-Yong Byun, et al. 2026. 15.6 A 36GB 3.3 TB/S HBM4 DRAM with Per-Channel TSV RDQS Auto Calibration and Fully-Programmable MBIST. In2026 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 69. IEEE, 264–266

2026
[19]

Hongju Kal, Seokmin Lee, Gun Ko, and Won Woo Ro. 2021. Space: locality-aware processing in heterogeneous memory for personalized recommendations. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 679–691

2021
[20]

Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S Lee, et al
[21]

In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)

Recnmp: Accelerating personalized recommendation with near-memory processing. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 790–803
[22]

Heewoo Kim, Sanjay Sri Vallabh Singapuram, Haojie Ye, Joseph Izraelevitz, Trevor Mudge, Ronald Dreslinski, and Nishil Talati. 2025. NMP-PaK: Near-memory processing acceleration of scalable de Novo genome assembly. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1834–1847

2025
[23]

Jin Hyun Kim, Shin-haeng Kang, Sukhan Lee, Hyeonsu Kim, Woongjae Song, Yuhwan Ro, Seungwon Lee, David Wang, Hyunsung Shin, Bengseng Phuah, et al
[24]

In2021 IEEE Hot Chips 33 Symposium (HCS)

Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond. In2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 1–26
[25]

Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, et al
[26]

Lol-pim: Long-context llm decoding with scalable dram-pim system.arXiv e-prints(2024), arXiv–2412

2024
[27]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

2023
[28]

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. InProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 740–753

2019
[29]

Yongkee Kwon, Kornijcuk Vladimir, Nahsung Kim, Woojae Shin, Jongsoon Won, Minkyu Lee, Hyunha Joo, Haerang Choi, Guhyun Kim, Byeongju An, et al. 2022. 12 AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving XXXX, XXXX, XXXX System architecture and software stack for GDDR6-AiM. In2022 IEEE Hot Chips 34 Symposium (HC...

2022
[30]

Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, et al. 2021. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 43–56

2021
[31]

Cong Li, Yihan Yin, Xintong Wu, Jingchen Zhu, Zhutianya Gao, Dimin Niu, Qiang Wu, Xin Si, Yuan Xie, Chen Zhang, et al. 2025. H2-llm: Hardware-dataflow co- exploration for heterogeneous hybrid-bonding-based low-batch llm inference. In Proceedings of the 52nd Annual International Symposium on Computer Architecture. 194–210

2025
[32]

Sixu Li, Yuzhou Chen, Chaojian Li, Yonggan Fu, Zheng Wang, Zhongzhi Yu, Hao- ran You, Zhifan Ye, Wei Zhou, Yongan Zhang, et al. 2025. ORCHES: Orchestrated Test-Time-Compute-based LLM Reasoning on Collaborative GPU-PIM HEteroge- neous System. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 476–489

2025
[33]

Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, et al. 2024. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache.arXiv preprint arXiv:2401.02669(2024)

work page arXiv 2024
[34]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2024. Ring Attention with Blockwise Transformers for Near-Infinite Context. InInternational Conference on Learning Representations

2024
[35]

Xiaoyu Ma and David Patterson. 2026. Challenges and Research Directions for Large Language Model Inference Hardware.arXiv preprint arXiv:2601.05047 (2026)

work page arXiv 2026
[36]

Meta AI. 2025. The Llama 4 Herd: The Beginning of a New Era of Natively Multi- modal AI Innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ Accessed: 2026-04-05

2025
[37]

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. 2025. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

2025
[38]

Naveen Muralimanohar, Rajeev Balasubramonian, Norman P Jouppi, et al. 2009. CACTI 6.0: A tool to model large caches.HP laboratories27 (2009), 28

2009
[39]

Dimin Niu, Shuangchen Li, Yuhao Wang, Wei Han, Zhe Zhang, Yijin Guan, Tianchan Guan, Fei Sun, Fei Xue, Lide Duan, et al. 2022. 184QPS/W 64Mb/mm 2 3D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system. In2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. IEEE, 1–3

2022
[40]

NVIDIA. 2023. NVIDIA H100 Tensor Core GPU Archtiecture. https://resources.nvidia.com/en-us-tensor-core

2023
[41]

NVIDIA. 2025. NVIDIA Blackwell Architecture Overview. https://resources. nvidia.com/en-us-blackwell-architecture

2025
[42]

NVIDIA. 2025. NVIDIA Management Library (NVML). https://developer.nvidia. com/management-library-nvml. Accessed: Apr. 2026

2025
[43]

NVIDIA. 2025. NVIDIA Rubin HGX Platform. https://www.nvidia.com/en- us/data-center/hgx/. Accessed: Apr. 2026

2025
[44]

Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan- Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient pre- fix caching for accelerating LLM-based multi-agent workflows.arXiv preprint arXiv:2507.07400(2025)

work page arXiv 2025
[45]

Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, and Jung Ho Ahn. 2024. AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. https://doi...

work page doi:10.1145/3620665 2024
[46]

Matthew Poremba, Itir Akgun, Jieming Yin, Onur Kayiran, Yuan Xie, and Gabriel H Loh. 2017. There and back again: Optimizing the interconnect in networks of memory cubes.ACM SIGARCH Computer Architecture News45, 2 (2017), 678–690

2017
[47]

Derrick Quinn, E Ezgi Yücel, Jinkwon Kim, José F Martínez, and Mohammad Alian. 2025. LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 34–48

2025
[48]

Ezgi Yücel, Martin Prammer, Zhenxing Fan, Kevin Skadron, Jignesh M

Derrick Quinn, E. Ezgi Yücel, Martin Prammer, Zhenxing Fan, Kevin Skadron, Jignesh M. Patel, José F. Martínez, and Mohammad Alian. 2025. DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign. InProceedings of the 52nd Annual International Symposium on Computer Archi- tecture (ISCA ’25). Association for Computing Machin...

work page doi:10.1145/3695053.3731079 2025
[49]

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna
[50]

In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms. In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 81–92
[51]

Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. Scale-sim: Systolic cnn accelerator simulator.arXiv preprint arXiv:1811.02883(2018)

work page Pith review arXiv 2018
[52]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

work page internal anchor Pith review arXiv 2019
[53]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)

work page internal anchor Pith review arXiv 2024
[54]

GLM-4.5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu...

work page internal anchor Pith review arXiv 2025
[55]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

work page internal anchor Pith review arXiv 2024
[56]

Kimi Team, Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong Ga...

work page internal anchor Pith review arXiv 2025
[57]

Boyu Tian, Yiwei Li, Li Jiang, Shuangyu Cai, and Mingyu Gao. 2024. Ndpbridge: Enabling cross-bank coordination in near-dram-bank processing architectures. 13 XXXX, XXXX, XXXX Zhongkai Yu, Haotian Ye, Chenyang Zhou, Ohm Rishabh Venkatachalam, Zaifeng Pan, Zhengding Hu, Junsung Kim, Won Woo Ro, Po-An Tsai, Shuyi Pei, Yangwook Kang, and Yufei Ding In2024 ACM...

2024
[58]

Universal Chiplet Interconnect Express, Inc. 2025. UCIe (Universal Chiplet Interconnect Express) Specification 3.0. https://www.uciexpress.org/3-0-spec- download. Accessed: Apr. 2026

2025
[59]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems

2017
[60]

Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, et al. 2025. Step-3 is large yet affordable: Model-system co-design for cost-effective decoding.arXiv preprint arXiv:2507.19427(2025)

work page arXiv 2025
[61]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[62]

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2023. Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 283–294

2023
[63]

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin
[64]

InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 640–654
[65]

Ruofan Wu, Jae-Won Chung, and Mosharaf Chowdhury. 2026. Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training.arXiv preprint arXiv:2601.17654(2026)

work page arXiv 2026
[66]

An Yang, Anfeng Li, Baosong Yang, et al. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review arXiv 2025
[67]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024
[68]

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35 (2022), 20744–20757

2022
[69]

Sungmin Yun, Byeongho Kim, Jaehyun Park, Hwayong Nam, Jung Ho Ahn, and Eojin Lee. 2022. GraNDe: Near-data processing architecture with adaptive matrix mapping for graph convolutional networks.IEEE Computer Architecture Letters 21, 2 (2022), 45–48

2022
[70]

Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, and Jung Ho Ahn. 2024. Duplex: A device for large language models with mixture of experts, grouped query atten- tion, and continuous batching. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1429–1443

2024
[71]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

2024
[72]

Minxuan Zhou, Weihong Xu, Jaeyoung Kang, and Tajana Rosing. 2022. TransPIM: A memory-based acceleration via software-hardware co-design for transformer. In2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1071–1085

2022
[73]

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al . 2025. Megascale-infer: Serving mixture-of-experts at scale with disaggregated expert parallelism.arXiv preprint arXiv:2504.02263(2025). 14

work page arXiv 2025