Recognition: 2 theorem links
· Lean TheoremSPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
Pith reviewed 2026-05-08 18:48 UTC · model grok-4.3
The pith
SPEC CPU2026 increases instruction volume, memory footprint, and instruction-cache pressure compared to SPEC CPU2017, while compact subsets preserve most of its behavior.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We find that, compared to SPEC CPU2017, SPEC CPU2026 increases instruction volume and memory footprint, and shifts pressure toward emerging bottlenecks, most notably higher instruction-cache stress. Using clustering-based representativeness analysis, we identify that compact subsets of 4-5 workloads per group preserve 96.4-99.9% of full-suite behavior, substantially reducing evaluation costs without sacrificing fidelity. SPEC CPU2026 remains a general-purpose suite with complementary characteristics to MLPerf and DCPerf, yet moves closer to real-world CPU behavior than prior generations.
What carries the argument
Clustering-based representativeness analysis on microarchitectural metrics measured across nine platforms, which identifies minimal workload subsets and supports direct cross-suite comparisons.
Load-bearing premise
The nine chosen platforms adequately represent the diversity of modern CPU microarchitectures and the clustering analysis on microarchitectural metrics accurately identifies subsets that preserve all relevant behaviors without missing critical interactions or outliers.
What would settle it
A new processor platform outside the nine shows substantially different ranking or bottleneck behavior when running the proposed compact subsets versus the full suite, or a previously unmeasured metric like energy per instruction deviates sharply from the reported trends.
Figures
read the original abstract
Specialized accelerators dominate AI workloads, but CPUs remain critical for orchestrating these accelerators and running datacenter services. As a result, CPU performance increasingly shapes end-to-end system efficiency, making it necessary for benchmarks to reflect modern workloads and bottlenecks. However, it remains unclear how emerging CPU benchmark suites reflect these shifts. To address this, we present the first comprehensive characterization of SPEC CPU2026 across nine platforms spanning recent Intel, AMD, Ampere, and Nvidia processors. We find that, compared to SPEC CPU2017, SPEC CPU2026 increases instruction volume and memory footprint, and shifts pressure toward emerging bottlenecks, most notably higher instruction-cache stress. We next examine whether the full suite is necessary for architectural evaluation. Using clustering-based representativeness analysis, we identify that compact subsets of 4-5 workloads per group preserve 96.4-99.9% of full-suite behavior, substantially reducing evaluation costs without sacrificing fidelity. To better position SPEC CPU2026, we compare it against SPEC CPU2017, DCPerf, and MLPerf using cross-suite microarchitectural metrics. SPEC CPU2026 remains a general-purpose suite with complementary characteristics: it is less vector-intensive than MLPerf and has lower frontend pressure than DCPerf, yet moves closer to real-world CPU behavior than prior SPEC CPU generations. Finally, we show that SPEC CPU2026 supports practical architectural studies beyond aggregate scores through case studies on page sizes and allocators, prefetching, compiler optimizations, ISA sensitivity, and many-core scaling. The new round-robin stagger mode generates proxy workloads that approximate DCPerf, reducing the IPC gap to 13.7%. Overall, SPEC CPU2026 sets a new foundation for rigorous and cost-effective CPU evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript provides the first comprehensive characterization of SPEC CPU2026 across nine platforms spanning recent Intel, AMD, Ampere, and Nvidia processors. Compared to SPEC CPU2017, it reports increased instruction volume and memory footprint with a shift toward higher instruction-cache stress. Clustering analysis identifies compact subsets of 4-5 workloads per group that preserve 96.4-99.9% of full-suite behavior. Cross-suite comparisons with SPEC CPU2017, DCPerf, and MLPerf position SPEC CPU2026 as a general-purpose suite with complementary characteristics (less vector-intensive than MLPerf, lower frontend pressure than DCPerf). Case studies demonstrate utility for studies on page sizes, allocators, prefetching, compiler optimizations, ISA sensitivity, and many-core scaling, including a round-robin stagger mode that approximates DCPerf with a 13.7% IPC gap reduction.
Significance. If the empirical results hold, this work is significant for the computer architecture community by updating benchmark understanding for modern CPU workloads that orchestrate accelerators and run datacenter services. The multi-platform measurements provide directional evidence on workload shifts, and the clustering-derived subsets could substantially lower evaluation costs while maintaining high fidelity. The cross-suite positioning and practical case studies offer actionable guidance for benchmark selection and architectural studies. Strengths include direct execution measurements on standard benchmarks across diverse vendors and the introduction of a new stagger mode for proxy workloads.
major comments (2)
- [§3 and Abstract] Platform selection (§3 and Abstract): The characterization relies on nine platforms (Intel, AMD, Ampere, Nvidia) to support claims of increased instruction volume, memory footprint, and higher i-cache stress versus CPU2017. However, this selection may under-sample axes such as vector width variations or novel cache coherence protocols, which is load-bearing for the generalizability of the 'shifts pressure toward emerging bottlenecks' claim; the paper should add explicit discussion of platform limitations and sensitivity tests.
- [Abstract and §4] Clustering-based representativeness analysis (Abstract and §4): The central cost-effectiveness claim rests on subsets of 4-5 workloads preserving 96.4-99.9% of full-suite behavior via clustering on microarchitectural metrics. The abstract presents these fidelity numbers without error bars, statistical details, full methodology (e.g., chosen metrics, algorithm parameters, or outlier validation), leaving it unclear whether all relevant interactions are captured; this requires expansion to verify the representativeness result.
minor comments (2)
- [Abstract] Abstract: Include the exact list of nine platforms and any confidence intervals or methodology summary for the 96.4-99.9% figures to improve clarity and allow immediate assessment of the fidelity claims.
- [Cross-suite comparison] Cross-suite comparison section: Specify the precise set of microarchitectural metrics used for positioning against DCPerf and MLPerf to enhance reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's significance and for the constructive major comments. We address each point below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3 and Abstract] Platform selection (§3 and Abstract): The characterization relies on nine platforms (Intel, AMD, Ampere, Nvidia) to support claims of increased instruction volume, memory footprint, and higher i-cache stress versus CPU2017. However, this selection may under-sample axes such as vector width variations or novel cache coherence protocols, which is load-bearing for the generalizability of the 'shifts pressure toward emerging bottlenecks' claim; the paper should add explicit discussion of platform limitations and sensitivity tests.
Authors: We agree that an explicit discussion of platform limitations is warranted to better support the generalizability of our claims regarding workload shifts. In the revised version, we will add a new subsection in §3 that details the rationale for selecting the nine platforms (spanning recent Intel, AMD, Ampere, and Nvidia processors) while acknowledging that they do not exhaustively cover all microarchitectural axes, such as every possible vector width variation or novel cache coherence protocols. We will also incorporate sensitivity tests by reporting how the key trends (increased instruction volume, memory footprint, and i-cache stress) hold consistently across the available platforms, thereby providing directional evidence without overstating universality. revision: yes
-
Referee: [Abstract and §4] Clustering-based representativeness analysis (Abstract and §4): The central cost-effectiveness claim rests on subsets of 4-5 workloads preserving 96.4-99.9% of full-suite behavior via clustering on microarchitectural metrics. The abstract presents these fidelity numbers without error bars, statistical details, full methodology (e.g., chosen metrics, algorithm parameters, or outlier validation), leaving it unclear whether all relevant interactions are captured; this requires expansion to verify the representativeness result.
Authors: We will expand §4 with the requested details to make the clustering analysis fully transparent and verifiable. Specifically, we will include: the complete list of microarchitectural metrics used for clustering, the exact algorithm and parameters (e.g., number of clusters and distance metric), outlier validation steps, and statistical measures such as error bars or variance on the reported 96.4-99.9% fidelity values. The abstract will be lightly revised to note that full methodological details and statistical support appear in §4. This expansion will clarify how the 4-5 workload subsets capture relevant interactions while preserving high fidelity to the full suite. revision: yes
Circularity Check
Empirical characterization with no circular derivations or self-referential reductions
full rationale
The paper's core claims rest on direct empirical measurements: running SPEC CPU2026 workloads on nine external platforms (Intel, AMD, Ampere, Nvidia) to quantify instruction volume, memory footprint, and i-cache stress relative to SPEC CPU2017, followed by standard clustering on microarchitectural counters to identify compact subsets whose measured behavior is then validated against the full suite (yielding the 96.4-99.9% preservation figures). Cross-suite positioning against DCPerf and MLPerf uses the same external metrics without any fitted parameters renamed as predictions, self-citations invoked for uniqueness theorems, or ansatzes smuggled from prior author work. No derivation step reduces by construction to its own inputs; the representativeness analysis is falsifiable via the reported metric comparisons and remains independent of the paper's conclusions.
Axiom & Free-Parameter Ledger
free parameters (1)
- subset size per group
axioms (1)
- domain assumption SPEC CPU workloads and the chosen microarchitectural metrics adequately capture real-world CPU bottlenecks in datacenter and AI-orchestration scenarios
Reference graph
Works this paper leans on
-
[1]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–12
2016
-
[2]
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. 2022. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. InSC22: International Conference for High Performance Computing, Networking, Storage and ...
2022
-
[3]
Ampere. 2026. Ampere Processor Platforms. https://amperecomputing.com/pr oducts/processors
2026
-
[4]
Ulf Andersson, Min Qiu, and Ziyang Zhang. 2006. Parallel power computation for photonic crystal devices.Methods and applications of analysis13, 2 (2006), 149–156
2006
-
[5]
Georgia Antoniou, Davide Bartolini, Haris Volos, Marios Kleanthous, Zhe Wang, Kleovoulos Kalaitzidis, Tom Rollet, Ziwei Li, Onur Mutlu, Yiannakis Sazeides, and Jawad Haj Yahya. 2024. Agile C-states: a core C-state architecture for latency critical applications optimizing both transition and cold-start latency. ACM Transactions on Architecture and Code Opt...
2024
-
[6]
ARM. 2026. The Arm ASTC Encoder, a compressor for the Adaptive Scalable Texture Compression data format. https://github.com/ARM-software/astc- encoder
2026
-
[7]
ARM. 2026. The world’s most efficient agentic CPU. https://www.arm.com/pr oducts/cloud-datacenter/arm-agi-cpu
2026
-
[8]
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J Rossbach, and Onur Mutlu. 2018. Mosaic: En- abling application-transparent support for multiple page sizes in throughput processors.ACM SIGOPS Operating Systems Review52, 1 (2018), 27–44
2018
-
[9]
Mohammad Bakhshalipour, Seyedali Tabaeiaghdaei, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Evaluation of hardware data prefetchers on server processors.ACM Computing Surveys (CSUR)52, 3 (2019), 1–29
2019
-
[10]
Jon Berndt. 2004. JSBSim: An open source flight dynamics model in C++. In AIAA modeling and simulation technologies conference and exhibit. 4923
2004
-
[11]
Vaughn Betz and Jonathan Rose. 1997. VPR: A new packing, placement and routing tool for FPGA research. InInternational Workshop on Field Programmable Logic and Applications. Springer, 213–222
1997
-
[12]
Ravi Bhargava and Kai Troester. 2024. AMD next-generation “Zen 4” core and 4th gen AMD EPYC server CPUs.IEEE Micro44, 3 (2024), 8–17
2024
-
[13]
Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator.ACM SIGARCH computer architecture news39, 2 (2011), 1–7
2011
-
[14]
Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0
F. Nisa Bostanci, Haocong Luo, Ataberk Olgun, Maria Makeenkova, Ger- aldo F. Oliveira, A. Giray Yaglikci, and Onur Mutlu. 2026. Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0. arXiv:2510.15744 [cs.AR] https://arxiv.org/abs/2510.15744
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Robert Brayton and Alan Mishchenko. 2010. ABC: An academic industrial- strength verification tool. InInternational Conference on Computer Aided Verifi- cation. Springer, 24–40
2010
-
[16]
Kistowski
James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017: Next-generation compute benchmark. InCompanion of the 2018 ACM/SPEC International Conference on Performance Engineering. 41–42
2018
-
[17]
Benjamin Buchfink, Klaus Reuter, and Hajk-Georg Drost. 2021. Sensitive protein alignments at tree-of-life scale using DIAMOND.Nature methods18, 4 (2021), 366–368
2021
-
[18]
Calin Cascaval, Evelyn Duesterwald, Peter F Sweeney, and Robert W Wisniewski
-
[19]
In14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05)
Multiple page size modeling and optimization. In14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE, 339–349
-
[20]
Hao Chen, Kim Laine, and Rachel Player. 2017. Simple encrypted arithmetic library-SEAL v2. 1. InInternational conference on financial cryptography and data security. Springer, 3–18
2017
-
[21]
Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, and Zhiru Zhang. 2024. Understanding the potential of fpga- based spatial acceleration for large language model inference.ACM Transactions on Reconfigurable Technology and Systems18, 1 (2024), 1–29
2024
-
[22]
Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks.IEEE journal of solid-state circuits52, 1 (2016), 127–138
2016
-
[23]
Weiwei Chu, Xinfeng Xie, Jiecao Yu, Jie Wang, Amar Phanishayee, Chunqiang Tang, Yuchen Hao, Jianyu Huang, Mustafa Ozdal, Jun Wang, Vedanuj Goswami, Naman Goyal, Abhishek Kadian, Andrew Gu, Chris Cai, Feng Tian, Xiaodong Wang, Min Si, Pavan Balaji, Ching-Hsiang Chu, and Jongsoo Park. 2025. Scaling Llama 3 Training with Efficient Parallelism Strategies. InP...
2025
-
[24]
Joel Coburn, Chunqiang Tang, Sameer Abu Asal, Neeraj Agrawal, Raviteja Chinta, Harish Dixit, Brian Dodds, Saritha Dwarakapuram, Amin Firoozshahian, Cao Gao, Kaustubh Gondkar, Tyler Graf, Junhan Hu, Jian Huang, Sterling Hughes, Adam Hutchin, Bhasker Jakka, Guoqiang Jerry Chen, Indu Kalyanara- man, Ashwin Kamath, Pankaj Kansal, Erum Kazi, Roman Levenstein, ...
2025
-
[25]
Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient address translation for architectures with multiple page sizes.ACM SIGPLAN Notices52, 4 (2017), 435–448
2017
-
[26]
Linker, Ronald M
Cooper Downs, Jon A. Linker, Ronald M. Caplan, Emily I. Mason, Pete Riley, Ryder Davidson, Andres Reyes, Erika Palmerio, Roberto Lionello, James Turtle, Michal Ben-Nun, Miko M. Stulajter, Viacheslav S. Titov, Tibor Török, Lisa A. Upton, Raphael Attie, Bibhuti K. Jha, Charles N. Arge, Carl J. Henney, Gher- ardo Valori, Hanna Strecker, Daniele Calchetti, Di...
2025
-
[27]
Pouya Esmaili-Dokht, Francesco Sgherzi, Valéria Soldera Girelli, Isaac Boix- aderas, Mariana Carmin, Alireza Monemi, Adrià Armejach, Estanislao Mercadal, Germán Llort, Petar Radojković, Miquel Moreto, Judit Giménez, Xavier Mar- torell, Eduard Ayguadé, Jesus Labarta, Emanuele Confalonieri, Rishabh Dubey, and Jason Adlard. 2024. A mess of memory system benc...
2024
-
[28]
Mark Evers, Leslie Barnes, and Mike Clark. 2022. The AMD next-generation “Zen 3” core.IEEE Micro42, 3 (2022), 7–12
2022
-
[29]
Facebook. 2026. Zstandard - Fast real-time compression algorithm. https: //github.com/facebook/zstd
2026
-
[30]
Yinxiao Feng and Kaisheng Ma. 2022. Chiplet actuary: A quantitative cost model and multi-chiplet architecture exploration. InProceedings of the 59th ACM/IEEE Design Automation Conference. 121–126
2022
-
[31]
Amin Firoozshahian, Joel Coburn, Roman Levenstein, Rakesh Nattoji, Ashwin Kamath, Olivia Wu, Gurdeepak Grewal, Harish Aepala, Bhasker Jakka, Bob Dreyer, Adam Hutchin, Utku Diril, Krishnakumar Nair, Ehsan K. Aredestani, Martin Schatz, Yuchen Hao, Rakesh Komuravelli, Kunming Ho, Sameer Abu Asal, Joe Shajrawi, Kevin Quinn, Nagesh Sreedhara, Pankaj Kansal, Wi...
2023
-
[32]
Reinhardt, Adrian M
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger. 2018. A configurable cloud-scale DNN processor fo...
2018
-
[33]
Kevin P Gaffney, Martin Prammer, Larry Brasfield, D Richard Hipp, Dan Kennedy, and Jignesh M Patel. 2022. SQLite: past, present, and future.Proceed- ings of the VLDB Endowment15, 12 (2022)
2022
-
[34]
Christophe Geuzaine and Jean-François Remacle. 2009. Gmsh: A 3-D finite element mesh generator with built-in pre-and post-processing facilities.Inter- national journal for numerical methods in engineering79, 11 (2009), 1309–1331
2009
-
[35]
Marc-Oliver Gewaltig and Markus Diesmann. 2007. Nest (neural simulation tool).Scholarpedia2, 4 (2007), 1430
2007
-
[36]
Abraham Gonzalez, Aasheesh Kolli, Samira Khan, Sihang Liu, Vidushi Dadu, Sagar Karandikar, Jichuan Chang, Krste Asanovic, and Parthasarathy Ran- ganathan. 2023. Profiling hyperscale big data processing. InProceedings of the 50th Annual International Symposium on Computer Architecture. 1–16
2023
-
[37]
Tom Goodale, Gabrielle Allen, Gerd Lanfermann, Joan Massó, Thomas Radke, Edward Seidel, and John Shalf. 2002. The cactus framework and toolkit: design and applications: invited talk. InInternational conference on high performance computing for computational science. Springer, 197–227
2002
-
[38]
Björn Gottschall, Silvio Campelo de Santana, and Magnus Jahre. 2023. Balancing accuracy and evaluation overhead in simulation point selection. In2023 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 43–53
2023
-
[39]
Darryl Gove. 2007. CPU2006 working set size.ACM SIGARCH Computer Architecture News35, 1 (2007), 90–96
2007
-
[40]
Danilo Guerrera, Rubén M Cabezón, Jean-Guillaume Piccinali, Aurélien Cavelan, Florina M Ciorba, David Imbert, Lucio Mayer, and Darren Reed. 2018. Towards a mini-app for smoothed particle hydrodynamics at exascale. In2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 607–614
2018
-
[41]
Faruk Guvenilir and Yale N Patt. 2020. Tailored page sizes. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 900–912
2020
-
[42]
Ranjan Hebbar SR and Aleksandar Milenković. 2019. SPEC CPU2017: Perfor- mance, event, and energy characterization on the core i7-8700K. InProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. 111–118
2019
-
[43]
John L Henning. 2002. SPEC CPU2000: Measuring CPU performance in the new millennium.Computer33, 7 (2002), 28–35
2002
-
[44]
John L Henning. 2006. SPEC CPU2006 benchmark descriptions.ACM SIGARCH Computer Architecture News34, 4 (2006), 1–17
2006
-
[45]
Andrew Hamilton Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}21). 257–273
2021
-
[46]
Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. Floatpim: In-memory acceleration of deep neural network training with high precision. InProceedings of the 46th International Symposium on Computer Architecture. 802–815
2019
-
[47]
Koki Ishida, Ilkwon Byun, Ikki Nagaoka, Kosuke Fukumitsu, Masamitsu Tanaka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Jangwoo Kim, and Koji In- oue. 2020. SuperNPU: An extremely fast neural processing unit using supercon- ducting logic devices. In2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 58–72
2020
-
[48]
Adam N Jacobvitz, Andrew D Hilton, and Daniel J Sorin. 2015. Multi-program benchmark definition. In2015 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 72–82
2015
-
[49]
Akanksha Jain, Hannah Lin, Carlos Villavieja, Baris Kasikci, Chris Kennelly, Milad Hashemi, and Parthasarathy Ranganathan. 2024. Limoncello: Prefetchers for scale. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 577–590
2024
-
[50]
Rishabh Jain, Scott Cheng, Vishwas Kalagi, Vrushabh Sanghavi, Samvit Kaul, Meena Arunachalam, Kiwan Maeng, Adwait Jog, Anand Sivasubramaniam, Mahmut Taylan Kandemir, and Chita R. Das. 2023. Optimizing cpu perfor- mance for recommendation systems at-scale. InProceedings of the 50th Annual International Symposium on Computer Architecture. 1–15
2023
-
[51]
Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2025. Neo: Sav- ing gpu memory crisis with cpu offloading for online llm inference.Proceedings of Machine Learning and Systems7 (2025)
2025
-
[52]
Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Clifford Young, Xiang Zhou, Zongwei Zhou, and David A Patterson. 2023. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. InProceedings of the 50th annual inter...
2023
-
[53]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Da- ley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richar...
2017
-
[54]
Jowi Morales. 2026. Are we staring down the barrel of an AI-driven CPU shortage? https://www.tomshardware.com/pc-components/cpus/cpus-are- cool-again-intel-and-amd-reporting-spikes-in-cpu-demand-due-to-agentic- ai-shortages-lisa-su-says-business-exceeded-expectations-while-intel-is- looking-at-long-term-agreements-with-potential-customers
2026
-
[55]
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. 2018. Marian: Fast neural machine translation in C++. InProceedings of ACL 2018, system demonstrations. 116–121
2018
-
[56]
Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse- scale computer. InProceedings of the 42nd annual international symposium on computer architecture. 158–169
2015
-
[57]
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and Krste Asanović. 2018. FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud. In2018 ACM/IEEE 45th ...
2018
-
[58]
Martin Kronbichler, Dmytro Sashko, and Peter Munch. 2023. Enhancing data lo- cality of the conjugate gradient method for high-order matrix-free finite-element implementations.The International Journal of High Performance Computing Applications37, 2 (2023), 61–81
2023
-
[59]
Jaewon Kwon, Yongju Lee, Hongju Kal, Minjae Kim, Youngsok Kim, and Won Woo Ro. 2023. McCore: A Holistic Management of High-Performance Het- erogeneous Multicores. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 1044–1058
2023
-
[60]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626
2023
-
[61]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. InInternational symposium on code generation and optimization, 2004. CGO 2004.IEEE, 75–86
2004
-
[62]
Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When prefetching works, when it doesn’t, and why.ACM Transactions on Architecture and Code Optimiza- tion (TACO)9, 1 (2012), 1–29
2012
-
[63]
Taehyung Lee, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. 2023. Memtis: Efficient memory tiering with dynamic page classification and page size determination. InProceedings of the 29th Symposium on Operating Systems Principles. 17–34
2023
-
[64]
Daan Leijen, Benjamin Zorn, and Leonardo De Moura. 2019. Mimalloc: Free list sharding in action. InAsian Symposium on Programming Languages and Systems. Springer, 244–265
2019
-
[65]
Ankur Limaye and Tosiron Adegbija. 2018. A workload characterization of the spec cpu2017 benchmark suite. In2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 149–158
2018
-
[66]
Linux Kernel Community. 2026. Linux perf tool. https://perf.wiki.kernel.org/i ndex.php/Main_Page
2026
-
[67]
Qiuyun Llull, Songchun Fan, Seyed Majid Zahedi, and Benjamin C Lee. 2017. Cooper: Task colocation with cooperative games. In2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 421–432
2017
-
[68]
Thomas J Macke and David A Case. 1998. Modeling unusual nucleic acid structures. ACS Publications
1998
-
[69]
Daniel Marjamäki. 2013. Cppcheck: a tool for static c/c++ code analysis.URL: https://cppcheck. sourceforge. io(2013)
2013
-
[70]
Maronga, S
B. Maronga, S. Banzhaf, C. Burmeister, T. Esch, R. Forkel, D. Fröhlich, V. Fuka, K. F. Gehrke, J. Geletič, S. Giersch, T. Gronemeier, G. Groß, W. Heldens, A. Hellsten, F. Hoffmann, A. Inagaki, E. Kadasch, F. Kanani-Sühring, K. Ketelsen, B. A. Khan, C. Knigge, H. Knoop, P. Krč, M. Kurppa, H. Maamari, A. Matzarakis, M. Mauder, M. Pallasch, D. Pavlik, J. Pfa...
2020
-
[71]
John, Tsuguchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, and Matei Zaharia
Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Mi- cikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor 13 Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazel- wood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Na...
2020
-
[72]
Simon McIntosh-Smith, Matthew Martineau, Tom Deakin, Grzegorz Pawelczak, Wayne Gaudin, Paul Garrett, Wei Liu, Richard Smedley-Stevenson, and David Beckingsale. 2017. Tealeaf: A mini-application to enable design-space explo- rations for iterative sparse linear solvers. In2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 842–849
2017
-
[73]
Richard C Murphy, Kyle B Wheeler, Brian W Barrett, and James A Ang. 2010. Introducing the graph 500.Cray Users Group (CUG)19, 45-74 (2010), 22
2010
-
[74]
Seonjin Na, Geonhwa Jeong, Byung H Ahn, Aaron Jezghani, Jeffrey Young, Christopher J Hughes, Tushar Krishna, and Hyesoon Kim. 2025. Flexinfer: Flexible llm inference with cpu computations.Proceedings of Machine Learning and Systems7 (2025)
2025
-
[75]
Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Jeffrey Young, Tushar Krishna, and Hyesoon Kim. 2024. Understanding performance implications of llm infer- ence on cpus. In2024 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 169–180
2024
-
[76]
Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H Loh, Mahesh Subramony, and Sean White. 2021. Pioneering chiplet technology and design for the amd epyc™and ryzen™processor families: Industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 57–70
2021
-
[77]
Coon, David Culler, Vidushi Dadu, Martin Dixon, Henry M
Arash Nasr-Esfahany, Mohammad Alizadeh, Victor Lee, Hanna Alam, Brett W. Coon, David Culler, Vidushi Dadu, Martin Dixon, Henry M. Levy, Santosh Pandey, Parthasarathy Ranganathan, and Amir Yazdanbakhsh. 2025. Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical- ML Fusion. InProceedings of the 52nd Annual International Symposi...
2025
-
[78]
Nevine Nassif, Ashley O. Munch, Carleton L. Molnar, Gerald Pasdast, Sitara- man V. Lyer, Zibing Yang, Oscar Mendoza, Mark Huddart, Srikrishnan Venkatara- man, Sireesha Kandula, Rafi Marom, Alexandra M. Kern, Bill Bowhill, David R. Mulvihill, Srikanth Nimmagadda, Varma Kalidindi, Jonathan Krause, Moham- mad M. Haq, Roopali Sharma, and Kevin Duda. 2022. Sap...
-
[79]
Agustín Navarro-Torres, Jesús Alastruey-Benedé, Pablo Ibáñez-Marín, and Víc- tor Viñals-Yúfera. 2019. Memory hierarchy characterization of SPEC CPU2006 and SPEC CPU2017 on the Intel Xeon Skylake-SP.Plos one14, 8 (2019), e0220135
2019
-
[80]
Nicholas Nethercote, Peter J Stuckey, Ralph Becket, Sebastian Brand, Gregory J Duck, and Guido Tack. 2007. MiniZinc: Towards a standard CP modelling language. InInternational conference on principles and practice of constraint programming. Springer, 529–543
2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.