Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Hong Wang; Ishita Vohra; Ritik Raj; Souvik Kundu; Tushar Krishna

arxiv: 2511.00739 · v3 · submitted 2025-11-01 · 💻 cs.AI · cs.LG· cs.MA

Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Ritik Raj , Souvik Kundu , Ishita Vohra , Hong Wang , Tushar Krishna This is my paper

Pith reviewed 2026-05-18 00:52 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords agentic AICPU-centric analysisscheduling optimizationmicro-batchinglatency reductionheterogeneous workloadsLLM servingsystem bottlenecks

0 comments

The pith

Agentic AI execution benefits from CPU-centric bottleneck analysis that leads to overlapped micro-batching and mixed scheduling, cutting latencies by factors of 1.7x to 3.9x on hybrid hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to shift focus from GPU-heavy LLM inference to the CPU's central role in agentic AI, where planning, tool calls, reasoning, and adaptation often execute or get orchestrated on the CPU in heterogeneous CPU-GPU setups. It starts with compile-time characterization to pick representative workloads that reflect algorithmic variety, then measures end-to-end latency and throughput at runtime on two distinct hardware systems to pinpoint specific architectural limits. From those measurements the authors derive two scheduling methods: CPU-Aware Overlapped Micro-Batching for uniform workloads and Mixed Agentic Scheduling for mixed request types, both designed to raise concurrent CPU-GPU use and lessen uneven resource splits. A sympathetic reader would care because agentic systems are expanding into autonomous problem-solving that depends on efficient CPU orchestration; fixing overlooked CPU bottlenecks could make real deployments faster and more scalable without new hardware.

Core claim

We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU-Aware Overlapped Micro-Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. These methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource a

What carries the argument

CPU-Aware Overlapped Micro-Batching (COMB) for homogeneous cases and Mixed Agentic Scheduling (MAS) for heterogeneous cases, which increase concurrent CPU-GPU utilization and balance allocation across request types.

If this is right

COMB delivers up to 1.7x lower P50 latency for standalone homogeneous workloads.
Under homogeneous open-loop load, COMB yields up to 3.9x lower service latency and 1.8x lower total latency.
MAS reduces total latency for minority request types by up to 2.37x at P50 and 2.49x at P90 in heterogeneous open-loop settings.
Both techniques improve performance by raising CPU-GPU overlap and correcting skewed resource allocation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending COMB and MAS to multi-agent workflows with dozens of tools could reveal whether the same overlap and mixing principles scale without additional coordination overhead.
The CPU-centric view might apply to other hybrid systems such as real-time robotic planning, where similar tool-orchestration loads occur.
If new agentic models shift more work to the CPU, the bottleneck patterns identified here could become the dominant constraint rather than GPU compute.
Combining elements of COMB and MAS into a single adaptive scheduler could handle workloads that transition between homogeneous and heterogeneous phases.

Load-bearing premise

The chosen representative workloads adequately represent the full diversity of agentic AI tasks and their CPU demands.

What would settle it

Measure the same latency and throughput metrics on a fresh collection of agentic workloads that differ substantially from the original representative set; if the reported speedups shrink or disappear, the optimizations do not generalize.

Figures

Figures reproduced from arXiv: 2511.00739 by Hong Wang, Ishita Vohra, Ritik Raj, Souvik Kundu, Tushar Krishna.

**Figure 1.** Figure 1: Characterization of agentic AI workloads on the basis of (a) Orchestrator (LLM and Host) (b) Agentic Path (Static and Dynamic) and (c) Repetitiveness (Single-step and Multi-step) of cores, coherence and synchronization or GPU factors - main memory capacity and bandwidth. 3 CPU dynamic energy consumes up to 44% of the total dynamic energy at large batch sizes. To the best of our knowledge, this is the first… view at source ↗

**Figure 2.** Figure 2: (a) Haystack with ENNS retrieval on QA benchmarks (b) Toolformer with WolframAlpha API on Math benchmarks (c) Chemcrow with literature (Arxiv/Pubmed) search tool on Chemistry benchmarks (d) Langchain with web search and LexRank summarization tools on QA benchmarks (e) Mini-SWE-Agent with bash/Python execution tools on coding benchmarks We choose a custom agentic pipeline (web search − > summarization − > … view at source ↗

**Figure 3.** Figure 3: Comparison of multi-processing and multi-threading with sequential baseline (single core) for Langchain workload 4.3 Throughput We begin by analyzing CPU parallelism in Section 4.3.1 and deriving effective strategies. Thereafter, we pair the learnt strategy on CPU side along with well-studied GPU parallelization strategy to parallelize multiple agentic requests. However, we identify two throughput bottlen… view at source ↗

**Figure 4.** Figure 4: (a) vLLM throughput saturation for GPT-OSS-20B model (b) Throughput saturation for various agentic workloads (c) Average time taken by different components in Langchain benchmark showing a critical CPU context switching bottleneck at batch size 128 and remote memory references incur higher latency that stall pipelines and saturate on-socket fabrics.(Mattson et al., 2008) argues that the overhead of cache c… view at source ↗

**Figure 5.** Figure 5: CPU (AMD Threadripper) and GPU (Nvidia B200) dynamic energy consumption for Langchain workload CPU 0-31 CPU 64-95 CPU 96-128 CPU 0-31 CPU 32-64 CPU 32-63 (a) Multi-processing (MP) (b) CGAM Time CPU 0-31 CPU 64-95 (c) CGAMoverlap Methods P50 P90 MP 2x 2x CGAM x 2x CGAM_overlap 1.2x 1.8x x 2x CPU 31-63 CPU 96-128 Batch Tool (CPU) Batch Inference (GPU) x 2x [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Timeline of batched agentic AI inference for (a) Multiprocessing, (b) CGAM, and (c) CGAMoverlap Key Takeaway 3: CPU dynamic energy share becomes significant (44%) at large batch size (128), as CPU parallelism is less energy efficient compared to GPU. 5 OPTIMIZATIONS Based on throughput saturation insights (Section 4.3), we present two scheduling optimizations- 1 CPU and GPU Aware Micro-batching (CGAM- Sec… view at source ↗

**Figure 7.** Figure 7: Comparison of CGAM and CGAMoverlap using Bcap = 64 against baseline (multi-processing or multi-threading) on (a) Langchain workload on FreshQA benchmark, (b) Haystack workload on NQ benchmark and (c) SWE-Agent on APPS benchmark 5.1.3 CGAMoverlap We can also utilize the remaining idle CPUs for more speed-up at the cost of energy. For mixed agentic workloads with comparable CPU and GPU latencies, we present… view at source ↗

**Figure 8.** Figure 8: Comparison of MAWS against multiprocessing baseline on 128 mixed Langchain tasks (half LLM heavy, half CPU heavy) [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of MAWS+CGAM against multiprocessing baseline on 256 mixed Langchain tasks Therefore, we need to limit the CPU usage of LLM heavy tasks. Since, they are LLM-heavy, we can use the lighter multi-threading for parallel vLLM API I/O. This frees up a lot of CPU resources making the CPU heavy tasks more effective. Therefore, we can optimize mixed agentic AI inference through adaptive multi-processin… view at source ↗

read the original abstract

Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU-GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU. Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU-Aware Overlapped Micro-Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. In specific, these methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution. Experimental evaluations on the two hardware systems demonstrate the efficacy of COMB in yielding up to 1.7x lower P50 latency in standalone homogeneous workload execution and up to 3.9x/1.8x lower service/total latency under homogeneous open-loop load. Additionally, for heterogeneous open-loop load, MAS can reduce the total latency for minority request-type by up to 2.37x/2.49x at P50/P90 percentile.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds a CPU-centric characterization of agentic AI and two schedulers (COMB, MAS) that cut latency in the reported tests, but the gains rest on workloads whose representativeness is assumed rather than shown.

read the letter

The main point is that agentic AI serving leans on the CPU for tools and orchestration more than pure LLM inference does, and the authors map out bottlenecks from that angle before proposing COMB for uniform loads and MAS for mixed ones. Their experiments on two hardware setups report concrete wins, including 1.7x lower P50 latency with COMB in standalone runs and up to 3.9x service latency under open-loop homogeneous load, plus MAS cutting minority-request latency by roughly 2.4x at P50/P90 in heterogeneous cases. That is useful data for anyone tuning CPU-GPU overlap in these systems. The compile-time workload selection and runtime breakdown give a clear path from observation to the two policies, and the numbers come from actual measurements rather than simulation alone. The soft spot is the workload step. They pick cases to cover algorithmic diversity, yet the abstract gives no breakdown of tool complexity, reasoning depth, or interaction patterns, so it is unclear whether the CPU bottlenecks and scheduler benefits would appear in a wider set of agents. Details on variance, exact baselines, and scheduler implementation are also thin here, which keeps the claims at moderate strength until the full paper is checked. This is for systems researchers and engineers who build serving stacks for agentic models on heterogeneous hardware. Readers who care about practical scheduling for mixed CPU-GPU traffic would get direct value. The work has enough new policies and measured results to go to a serious referee, though it will likely need more on workload coverage and statistical robustness.

Referee Report

2 major / 2 minor

Summary. The paper characterizes agentic AI execution from a CPU-centric perspective on heterogeneous CPU-GPU systems. It performs compile-time analysis to select representative workloads capturing algorithmic diversity, conducts runtime measurements on two hardware systems to isolate architectural bottlenecks, and proposes CPU-Aware Overlapped Micro-Batching (COMB) for homogeneous workloads and Mixed Agentic Scheduling (MAS) for heterogeneous workloads to improve concurrent utilization and reduce skewed allocation. Experiments report up to 1.7x lower P50 latency with COMB and up to 3.9x/1.8x and 2.37x/2.49x latency reductions with MAS under open-loop loads.

Significance. If the results hold with stronger validation, the work provides timely empirical insights into overlooked CPU bottlenecks in agentic AI serving and practical scheduling techniques for better CPU-GPU utilization. The focus on compile-time/runtime characterization and concrete latency gains on real hardware systems adds practical value for optimizing autonomous agent deployments.

major comments (2)

[§3] §3 (compile-time characterization and representative workloads): The selection of workloads is described as capturing 'algorithmic diversity' but no verification, diversity metrics, or justification of representativeness (e.g., coverage of tool complexity or interaction patterns) is provided. This assumption is load-bearing for the central claim, as non-representative workloads would undermine the identified CPU bottlenecks and the reported efficacy of COMB and MAS.
[§5] §5 (runtime characterization and experimental evaluation): The reported improvements (1.7x P50 latency for COMB; 3.9x/1.8x service/total and 2.37x/2.49x minority latency for MAS) lack details on statistical variance across runs, exact baseline scheduler implementations, workload selection criteria, and hardware-specific configurations. These omissions make it difficult to assess robustness and reproducibility of the bottleneck isolation and optimization claims.

minor comments (2)

[Abstract] The abstract introduces COMB and MAS without expanding the acronyms on first use, which reduces immediate readability.
[§5] Figure captions and axis labels in the runtime characterization plots could more explicitly state the exact metrics (P50 vs. P90) and load conditions for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript characterizing CPU bottlenecks in agentic AI serving. We address each major comment point by point below and have revised the paper to improve justification and reproducibility where the comments identify gaps.

read point-by-point responses

Referee: [§3] §3 (compile-time characterization and representative workloads): The selection of workloads is described as capturing 'algorithmic diversity' but no verification, diversity metrics, or justification of representativeness (e.g., coverage of tool complexity or interaction patterns) is provided. This assumption is load-bearing for the central claim, as non-representative workloads would undermine the identified CPU bottlenecks and the reported efficacy of COMB and MAS.

Authors: We agree that the original manuscript would benefit from explicit justification and metrics for workload representativeness. In the revised version, we have expanded Section 3 with a new subsection that details our selection criteria. Workloads were drawn from widely used agentic frameworks (LangChain, AutoGPT, and ReAct-style agents) to span variations in tool-call density, reasoning depth, and external service interaction patterns. We now report simple compile-time diversity metrics, including variance in CPU instruction counts, memory access patterns, and tool complexity scores across the chosen set. While these workloads do not exhaustively cover every conceivable agentic behavior, they target the primary sources of CPU heterogeneity that drive the bottlenecks analyzed in the paper. This addition directly supports the central claims without altering the experimental results. revision: yes
Referee: [§5] §5 (runtime characterization and experimental evaluation): The reported improvements (1.7x P50 latency for COMB; 3.9x/1.8x service/total and 2.37x/2.49x minority latency for MAS) lack details on statistical variance across runs, exact baseline scheduler implementations, workload selection criteria, and hardware-specific configurations. These omissions make it difficult to assess robustness and reproducibility of the bottleneck isolation and optimization claims.

Authors: We concur that additional experimental details are required for robustness and reproducibility. The revised Section 5 now includes: (1) statistical variance reported as mean ± standard deviation over five independent runs for all latency and throughput figures; (2) the baseline scheduler described as the default FIFO scheduler in our serving stack with no explicit CPU affinity or priority settings; (3) workload selection criteria explicitly linked to the compile-time analysis (high vs. low tool-call density and reasoning chain length); and (4) precise hardware configurations, including CPU model (Intel Xeon Gold 6248R), GPU (NVIDIA A100), memory bandwidth, and OS/kernel settings for each of the two systems. A new appendix provides configuration files and raw measurement data. These changes allow readers to replicate the bottleneck isolation and the reported gains from COMB and MAS. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical characterization and scheduling optimizations rest on direct measurements

full rationale

The paper performs compile-time and runtime characterization of selected agentic AI workloads on heterogeneous CPU-GPU hardware, identifies bottlenecks through latency and throughput measurements, and evaluates two proposed schedulers (COMB for homogeneous and MAS for heterogeneous cases) via explicit experiments on two systems. All reported gains (1.7x P50 latency, 3.9x/1.8x service/total latency, 2.37x/2.49x minority latency) are obtained from these hardware runs rather than from any equations, fitted parameters renamed as predictions, or self-citations that close a definitional loop. Workload selection is presented as a methodological choice whose representativeness is an external assumption, not a self-referential derivation; no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard domain assumptions about heterogeneous hardware and workload diversity without introducing new free parameters or invented entities.

axioms (1)

domain assumption Agentic AI serving heavily relies on heterogeneous CPU-GPU systems with majority of external tools run on or orchestrated by the CPU.
This premise is stated directly in the abstract and underpins the entire CPU-centric perspective.

pith-pipeline@v0.9.0 · 5844 in / 1385 out tokens · 41836 ms · 2026-05-18T00:52:07.761431+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems
cs.AI 2026-05 unverdicted novelty 7.0

Proposes EpG and OOI metrics showing agentic workflows use 4.33x more energy per successful goal than linear baselines due to orchestration structure.
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
cs.AR 2026-05 unverdicted novelty 7.0

SPEC CPU2026 increases instruction volume and memory footprint while shifting pressure to instruction-cache bottlenecks; 4-5 workload subsets per group preserve 96.4-99.9% of full-suite behavior and show complementary...
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
cs.AR 2026-05 unverdicted novelty 6.0

SPEC CPU2026 raises instruction volume and memory demands while shifting pressure to instruction caches; 4-5 workload subsets per group preserve 96.4-99.9% of full-suite microarchitectural behavior and better approxim...
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
cs.DC 2026-04 unverdicted novelty 6.0

KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
LARA: Validation-Driven Agentic Supercomputer Workflows for Atomistic Modeling
physics.comp-ph 2026-04 unverdicted novelty 4.0

LARA-HPC introduces a validation-first agentic system with dry-run verification and multi-phase refinement that improves robustness of AI-generated DFT workflows on HPC systems.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 4 Pith papers · 24 internal anchors

[1]

Phi-4 Technical Report

AgentGPT.https://agentgpt.reworkd.ai/. LlamaIndex - Build Knowledge Assistants over your Enter- prise Data.https://www.llamaindex.ai/,. Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Efficient and scalable agentic ai with heteroge- neous systems.arXiv preprint arXiv:2507.19635, 2025

Anthropic. Claude code. https://www.claude.com/ product/claude-code. Asgar, Z., Nguyen, M., and Katti, S. Efficient and scalable agentic ai with heterogeneous systems.arXiv preprint arXiv:2507.19635,

work page arXiv
[3]

Small Language Models are the Future of Agentic AI

URL https://arxiv.org/abs/2506.02153. Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., and Evans, O. Taken out of context: On measuring situational awareness in llms.arXiv preprint arXiv:2309.00667,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Measuring NUMA effects with the STREAM benchmark

Bergstrom, L. Measuring numa effects with the stream benchmark.arXiv preprint arXiv:1103.3225,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[6]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus,

Deepset-Ai. haystack. https://github.com/ deepset-ai/haystack. Dodge, J., Sap, M., Marasovi ´c, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus.arXiv preprint arXiv:2104.08758,

work page arXiv
[7]

The Faiss library

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazar´e, P.-E., Lomeli, M., Hosseini, L., and J ´egou, H. The faiss library.arXiv preprint arXiv:2401.08281,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Gunasekar, S., Zhang, Y ., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[10]

Measuring Coding Challenge Competence With APPS

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language mod- els resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

The cost of dynamic reasoning: Demystifying ai agents and test- time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301,

Kim, J., Shin, B., Chung, J., and Rhu, M. The cost of dynamic reasoning: Demystifying ai agents and test- time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301,

work page arXiv
[16]

Internet-augmented dialogue generation

Komeili, M., Shuster, K., and Weston, J. Internet-augmented dialogue generation.arXiv preprint arXiv:2107.07566,

work page arXiv
[17]

Mawps: A math word problem reposi- tory

Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. Mawps: A math word problem reposi- tory. InProceedings of the 2016 conference of the north american chapter of the association for computational lin- guistics: human language technologies, pp. 1152–1157,

work page 2016
[18]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

AgentBench: Evaluating LLMs as Agents

Liu, X., Yu, H., Zhang, H., Xu, Y ., Lei, X., Lai, H., Gu, Y ., Ding, H., Men, K., Yang, K., et al. Agentbench: Evalu- ating llms as agents.arXiv preprint arXiv:2308.03688,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

G., Van der Wijngaart, R., and Frumkin, M

Mattson, T. G., Van der Wijngaart, R., and Frumkin, M. Pro- gramming the intel 80-core network-on-a-chip terascale processor. InSC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–11. IEEE,

work page 2008
[21]

On faithfulness and factuality in abstractive summarization

Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661,

work page arXiv 2005
[22]

Miao, C.-C

Miao, S.-Y ., Liang, C.-C., and Su, K.-Y . A diverse corpus for evaluating and developing english math word problem solvers.arXiv preprint arXiv:2106.15772,

work page arXiv
[23]

WebGPT: Browser-assisted question-answering with human feedback

microsoft. GitHub - microsoft/semantic-kernel: Integrate cutting-edge LLM technology quickly and easily into your apps. https://github.com/microsoft/ semantic-kernel. Nakajima, Y . Babyagi, 2023.https://github.com/ yoheinakajima/babyagi. Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et a...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me the details, just the summary! topic-aware convolutional neu- ral networks for extreme summarization.arXiv preprint arXiv:1808.08745,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pignatelli, E., Kuci´nski, Ł., Pinto, L., Fer- gus, R., et al. Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543,

work page arXiv
[26]

From single core to multi-core: preparing for a new exponential

Parkhurst, J., Darringer, J., and Grundmann, B. From single core to multi-core: preparing for a new exponential. In Proceedings of the 2006 IEEE/ACM international confer- ence on Computer-aided design, pp. 67–72,

work page 2006
[27]

Are NLP Models really able to Solve Simple Math Word Problems?

Patel, A., Bhattamishra, S., and Goyal, N. Are nlp models really able to solve simple math word problems?arXiv preprint arXiv:2103.07191,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Large language models can self-improve at web agent tasks

Patel, A., Hofmarcher, M., Leoveanu-Condrei, C., Dinu, M.-C., Callison-Burch, C., and Hochreiter, S. Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309,

work page arXiv
[29]

Quinn, D., Nouri, M., Patel, N., Salihu, J., Salemi, A., Lee, S., Zamani, H., and Alian, M

Hugging Face model card; accessed 2025-10-05. Quinn, D., Nouri, M., Patel, N., Salihu, J., Salemi, A., Lee, S., Zamani, H., and Alian, M. Accelerating retrieval- augmented generation. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 15–32,

work page 2025
[30]

Know What You Don't Know: Unanswerable Questions for SQuAD

Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad.arXiv preprint arXiv:1806.03822,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, and Josep Ll

Recasens, P. G., Agullo, F., Zhu, Y ., Wang, C., Lee, E. K., Tardieu, O., Torres, J., and Berral, J. L. Mind the mem- ory gap: Unveiling gpu bottlenecks in large-batch llm inference.arXiv preprint arXiv:2503.08311,

work page arXiv
[32]

Agentic AI: A Conceptual Taxonomy, Applications and Challenges

Sapkota, R., Roumeliotis, K. I., and Karkee, M. Ai agents vs. agentic ai: A conceptual taxonomy, applications and challenges.arXiv preprint arXiv:2505.10468,

work page arXiv
[33]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Shridhar, M., Yuan, X., C ˆot´e, M.-A., Bisk, Y ., Trischler, A., and Hausknecht, M. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[34]

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Singh, J., Magazine, R., Pandya, Y ., and Nambi, A. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

work page internal anchor Pith review arXiv
[35]

J., Ting, D

Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutier- rez, L., Tan, T. F., and Ting, D. S. W. Large language models in medicine.Nature medicine, 29(8):1930–1940,

work page 1930
[36]

Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y .-H., Zhou, D., Le, Q., et al

URL https://arxiv.org/abs/ 2504.11750. Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y .-H., Zhou, D., Le, Q., et al. Freshllms: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214,

work page arXiv
[37]

Large language models for education: A survey and outlook

Wang, S., Xu, T., Li, H., Zhang, C., Liang, J., Tang, J., Yu, P. S., and Wen, Q. Large language models for education: A survey and outlook.arXiv preprint arXiv:2403.18105,

work page arXiv
[38]

Conveyor: Effi- cient tool-aware llm serving with tool partial execution

Xu, Y ., Kong, X., Chen, T., and Zhuo, D. Conveyor: Effi- cient tool-aware llm serving with tool partial execution. arXiv preprint arXiv:2406.00059,

work page arXiv
[39]

Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

Yang, H., Yue, S., and He, Y . Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224,

work page arXiv
[40]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question an- swering.arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Geogpt: Understanding and processing geospatial tasks through an autonomous gpt.arXiv preprint arXiv:2307.07930,

Zhang, Y ., Wei, C., Wu, S., He, Z., and Yu, W. Geogpt: Understanding and processing geospatial tasks through an autonomous gpt.arXiv preprint arXiv:2307.07930,

work page arXiv
[43]

A Survey of Large Language Models

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2),

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y .-X. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023a. Zhou, G., Hong, Y ., and Wu, Q. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on A...

work page internal anchor Pith review Pith/arXiv arXiv
[45]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023b. Zhuo, T. Y ., Vu, M. C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., et al. Big- codebench: Bench...

work page internal anchor Pith review Pith/arXiv arXiv
[46]

A WORKLOADIMPLEMENTATIONDETAILS A.1 Toolformer We choose the same AI model (GPT-J 6B), calculation tool (WolframAlpha API (Wolfram—Alpha)) and mathematical benchmarks (ASDiv (Miao et al., 2021), SV AMP (Patel et al.,

work page 2021
[47]

A.2 SWE-Agent We choose mini-SWE-agent (SWE-agent), a research bench- marking version of SWE-agent using Qwen2.5-Coder-32B (Hui et al.,

and MAWPS (Koncel-Kedziorski et al., 2016)) for profiling as used in the original paper (Schick et al., 2023). A.2 SWE-Agent We choose mini-SWE-agent (SWE-agent), a research bench- marking version of SWE-agent using Qwen2.5-Coder-32B (Hui et al.,

work page 2016
[48]

We choose benchmarks derived from APPS (Hendrycks et al., 2021), BigCodeBench (Zhuo et al.,

model specifically suited for coding ap- plications. We choose benchmarks derived from APPS (Hendrycks et al., 2021), BigCodeBench (Zhuo et al.,

work page 2021
[49]

A.3 Haystack We choose ENNS top-5 retrieval using faiss FLAT (Douze et al.,

and DS-1000 (Lai et al., 2023), which are computation- ally intensive and can comprehensively showcase the CPU perspective. A.3 Haystack We choose ENNS top-5 retrieval using faiss FLAT (Douze et al.,

work page 2023
[50]

document corpus (305 GB english variant) for pro- filing using Natural Questions (NQ) (Kwiatkowski et al., 2019), HotpotQA (Yang et al.,

work page 2019
[51]

We evaluate the workload on FreshQA (Vu et al., 2023), MusiQue (Trivedi et al.,

summarizer for summarization and GPT-OSS-20B model for LLM inference. We evaluate the workload on FreshQA (Vu et al., 2023), MusiQue (Trivedi et al.,

work page 2023

[1] [1]

Phi-4 Technical Report

AgentGPT.https://agentgpt.reworkd.ai/. LlamaIndex - Build Knowledge Assistants over your Enter- prise Data.https://www.llamaindex.ai/,. Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Efficient and scalable agentic ai with heteroge- neous systems.arXiv preprint arXiv:2507.19635, 2025

Anthropic. Claude code. https://www.claude.com/ product/claude-code. Asgar, Z., Nguyen, M., and Katti, S. Efficient and scalable agentic ai with heterogeneous systems.arXiv preprint arXiv:2507.19635,

work page arXiv

[3] [3]

Small Language Models are the Future of Agentic AI

URL https://arxiv.org/abs/2506.02153. Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., and Evans, O. Taken out of context: On measuring situational awareness in llms.arXiv preprint arXiv:2309.00667,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Measuring NUMA effects with the STREAM benchmark

Bergstrom, L. Measuring numa effects with the stream benchmark.arXiv preprint arXiv:1103.3225,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901

[6] [6]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus,

Deepset-Ai. haystack. https://github.com/ deepset-ai/haystack. Dodge, J., Sap, M., Marasovi ´c, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus.arXiv preprint arXiv:2104.08758,

work page arXiv

[7] [7]

The Faiss library

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazar´e, P.-E., Lomeli, M., Hosseini, L., and J ´egou, H. The faiss library.arXiv preprint arXiv:2401.08281,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Gunasekar, S., Zhang, Y ., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[10] [10]

Measuring Coding Challenge Competence With APPS

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language mod- els resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

The cost of dynamic reasoning: Demystifying ai agents and test- time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301,

Kim, J., Shin, B., Chung, J., and Rhu, M. The cost of dynamic reasoning: Demystifying ai agents and test- time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301,

work page arXiv

[16] [16]

Internet-augmented dialogue generation

Komeili, M., Shuster, K., and Weston, J. Internet-augmented dialogue generation.arXiv preprint arXiv:2107.07566,

work page arXiv

[17] [17]

Mawps: A math word problem reposi- tory

Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. Mawps: A math word problem reposi- tory. InProceedings of the 2016 conference of the north american chapter of the association for computational lin- guistics: human language technologies, pp. 1152–1157,

work page 2016

[18] [18]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

AgentBench: Evaluating LLMs as Agents

Liu, X., Yu, H., Zhang, H., Xu, Y ., Lei, X., Lai, H., Gu, Y ., Ding, H., Men, K., Yang, K., et al. Agentbench: Evalu- ating llms as agents.arXiv preprint arXiv:2308.03688,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

G., Van der Wijngaart, R., and Frumkin, M

Mattson, T. G., Van der Wijngaart, R., and Frumkin, M. Pro- gramming the intel 80-core network-on-a-chip terascale processor. InSC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–11. IEEE,

work page 2008

[21] [21]

On faithfulness and factuality in abstractive summarization

Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661,

work page arXiv 2005

[22] [22]

Miao, C.-C

Miao, S.-Y ., Liang, C.-C., and Su, K.-Y . A diverse corpus for evaluating and developing english math word problem solvers.arXiv preprint arXiv:2106.15772,

work page arXiv

[23] [23]

WebGPT: Browser-assisted question-answering with human feedback

microsoft. GitHub - microsoft/semantic-kernel: Integrate cutting-edge LLM technology quickly and easily into your apps. https://github.com/microsoft/ semantic-kernel. Nakajima, Y . Babyagi, 2023.https://github.com/ yoheinakajima/babyagi. Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et a...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me the details, just the summary! topic-aware convolutional neu- ral networks for extreme summarization.arXiv preprint arXiv:1808.08745,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024

Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pignatelli, E., Kuci´nski, Ł., Pinto, L., Fer- gus, R., et al. Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543,

work page arXiv

[26] [26]

From single core to multi-core: preparing for a new exponential

Parkhurst, J., Darringer, J., and Grundmann, B. From single core to multi-core: preparing for a new exponential. In Proceedings of the 2006 IEEE/ACM international confer- ence on Computer-aided design, pp. 67–72,

work page 2006

[27] [27]

Are NLP Models really able to Solve Simple Math Word Problems?

Patel, A., Bhattamishra, S., and Goyal, N. Are nlp models really able to solve simple math word problems?arXiv preprint arXiv:2103.07191,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Large language models can self-improve at web agent tasks

Patel, A., Hofmarcher, M., Leoveanu-Condrei, C., Dinu, M.-C., Callison-Burch, C., and Hochreiter, S. Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309,

work page arXiv

[29] [29]

Quinn, D., Nouri, M., Patel, N., Salihu, J., Salemi, A., Lee, S., Zamani, H., and Alian, M

Hugging Face model card; accessed 2025-10-05. Quinn, D., Nouri, M., Patel, N., Salihu, J., Salemi, A., Lee, S., Zamani, H., and Alian, M. Accelerating retrieval- augmented generation. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 15–32,

work page 2025

[30] [30]

Know What You Don't Know: Unanswerable Questions for SQuAD

Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad.arXiv preprint arXiv:1806.03822,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, and Josep Ll

Recasens, P. G., Agullo, F., Zhu, Y ., Wang, C., Lee, E. K., Tardieu, O., Torres, J., and Berral, J. L. Mind the mem- ory gap: Unveiling gpu bottlenecks in large-batch llm inference.arXiv preprint arXiv:2503.08311,

work page arXiv

[32] [32]

Agentic AI: A Conceptual Taxonomy, Applications and Challenges

Sapkota, R., Roumeliotis, K. I., and Karkee, M. Ai agents vs. agentic ai: A conceptual taxonomy, applications and challenges.arXiv preprint arXiv:2505.10468,

work page arXiv

[33] [33]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Shridhar, M., Yuan, X., C ˆot´e, M.-A., Bisk, Y ., Trischler, A., and Hausknecht, M. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[34] [34]

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Singh, J., Magazine, R., Pandya, Y ., and Nambi, A. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

work page internal anchor Pith review arXiv

[35] [35]

J., Ting, D

Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutier- rez, L., Tan, T. F., and Ting, D. S. W. Large language models in medicine.Nature medicine, 29(8):1930–1940,

work page 1930

[36] [36]

Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y .-H., Zhou, D., Le, Q., et al

URL https://arxiv.org/abs/ 2504.11750. Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y .-H., Zhou, D., Le, Q., et al. Freshllms: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214,

work page arXiv

[37] [37]

Large language models for education: A survey and outlook

Wang, S., Xu, T., Li, H., Zhang, C., Liang, J., Tang, J., Yu, P. S., and Wen, Q. Large language models for education: A survey and outlook.arXiv preprint arXiv:2403.18105,

work page arXiv

[38] [38]

Conveyor: Effi- cient tool-aware llm serving with tool partial execution

Xu, Y ., Kong, X., Chen, T., and Zhuo, D. Conveyor: Effi- cient tool-aware llm serving with tool partial execution. arXiv preprint arXiv:2406.00059,

work page arXiv

[39] [39]

Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

Yang, H., Yue, S., and He, Y . Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224,

work page arXiv

[40] [40]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question an- swering.arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Geogpt: Understanding and processing geospatial tasks through an autonomous gpt.arXiv preprint arXiv:2307.07930,

Zhang, Y ., Wei, C., Wu, S., He, Z., and Yu, W. Geogpt: Understanding and processing geospatial tasks through an autonomous gpt.arXiv preprint arXiv:2307.07930,

work page arXiv

[43] [43]

A Survey of Large Language Models

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2),

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y .-X. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023a. Zhou, G., Hong, Y ., and Wu, Q. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on A...

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023b. Zhuo, T. Y ., Vu, M. C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., et al. Big- codebench: Bench...

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

A WORKLOADIMPLEMENTATIONDETAILS A.1 Toolformer We choose the same AI model (GPT-J 6B), calculation tool (WolframAlpha API (Wolfram—Alpha)) and mathematical benchmarks (ASDiv (Miao et al., 2021), SV AMP (Patel et al.,

work page 2021

[47] [47]

A.2 SWE-Agent We choose mini-SWE-agent (SWE-agent), a research bench- marking version of SWE-agent using Qwen2.5-Coder-32B (Hui et al.,

and MAWPS (Koncel-Kedziorski et al., 2016)) for profiling as used in the original paper (Schick et al., 2023). A.2 SWE-Agent We choose mini-SWE-agent (SWE-agent), a research bench- marking version of SWE-agent using Qwen2.5-Coder-32B (Hui et al.,

work page 2016

[48] [48]

We choose benchmarks derived from APPS (Hendrycks et al., 2021), BigCodeBench (Zhuo et al.,

model specifically suited for coding ap- plications. We choose benchmarks derived from APPS (Hendrycks et al., 2021), BigCodeBench (Zhuo et al.,

work page 2021

[49] [49]

A.3 Haystack We choose ENNS top-5 retrieval using faiss FLAT (Douze et al.,

and DS-1000 (Lai et al., 2023), which are computation- ally intensive and can comprehensively showcase the CPU perspective. A.3 Haystack We choose ENNS top-5 retrieval using faiss FLAT (Douze et al.,

work page 2023

[50] [50]

document corpus (305 GB english variant) for pro- filing using Natural Questions (NQ) (Kwiatkowski et al., 2019), HotpotQA (Yang et al.,

work page 2019

[51] [51]

We evaluate the workload on FreshQA (Vu et al., 2023), MusiQue (Trivedi et al.,

summarizer for summarization and GPT-OSS-20B model for LLM inference. We evaluate the workload on FreshQA (Vu et al., 2023), MusiQue (Trivedi et al.,

work page 2023