Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
Pith reviewed 2026-05-18 00:52 UTC · model grok-4.3
The pith
Agentic AI execution benefits from CPU-centric bottleneck analysis that leads to overlapped micro-batching and mixed scheduling, cutting latencies by factors of 1.7x to 3.9x on hybrid hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU-Aware Overlapped Micro-Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. These methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource a
What carries the argument
CPU-Aware Overlapped Micro-Batching (COMB) for homogeneous cases and Mixed Agentic Scheduling (MAS) for heterogeneous cases, which increase concurrent CPU-GPU utilization and balance allocation across request types.
If this is right
- COMB delivers up to 1.7x lower P50 latency for standalone homogeneous workloads.
- Under homogeneous open-loop load, COMB yields up to 3.9x lower service latency and 1.8x lower total latency.
- MAS reduces total latency for minority request types by up to 2.37x at P50 and 2.49x at P90 in heterogeneous open-loop settings.
- Both techniques improve performance by raising CPU-GPU overlap and correcting skewed resource allocation.
Where Pith is reading between the lines
- Extending COMB and MAS to multi-agent workflows with dozens of tools could reveal whether the same overlap and mixing principles scale without additional coordination overhead.
- The CPU-centric view might apply to other hybrid systems such as real-time robotic planning, where similar tool-orchestration loads occur.
- If new agentic models shift more work to the CPU, the bottleneck patterns identified here could become the dominant constraint rather than GPU compute.
- Combining elements of COMB and MAS into a single adaptive scheduler could handle workloads that transition between homogeneous and heterogeneous phases.
Load-bearing premise
The chosen representative workloads adequately represent the full diversity of agentic AI tasks and their CPU demands.
What would settle it
Measure the same latency and throughput metrics on a fresh collection of agentic workloads that differ substantially from the original representative set; if the reported speedups shrink or disappear, the optimizations do not generalize.
Figures
read the original abstract
Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU-GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU. Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU-Aware Overlapped Micro-Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. In specific, these methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution. Experimental evaluations on the two hardware systems demonstrate the efficacy of COMB in yielding up to 1.7x lower P50 latency in standalone homogeneous workload execution and up to 3.9x/1.8x lower service/total latency under homogeneous open-loop load. Additionally, for heterogeneous open-loop load, MAS can reduce the total latency for minority request-type by up to 2.37x/2.49x at P50/P90 percentile.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper characterizes agentic AI execution from a CPU-centric perspective on heterogeneous CPU-GPU systems. It performs compile-time analysis to select representative workloads capturing algorithmic diversity, conducts runtime measurements on two hardware systems to isolate architectural bottlenecks, and proposes CPU-Aware Overlapped Micro-Batching (COMB) for homogeneous workloads and Mixed Agentic Scheduling (MAS) for heterogeneous workloads to improve concurrent utilization and reduce skewed allocation. Experiments report up to 1.7x lower P50 latency with COMB and up to 3.9x/1.8x and 2.37x/2.49x latency reductions with MAS under open-loop loads.
Significance. If the results hold with stronger validation, the work provides timely empirical insights into overlooked CPU bottlenecks in agentic AI serving and practical scheduling techniques for better CPU-GPU utilization. The focus on compile-time/runtime characterization and concrete latency gains on real hardware systems adds practical value for optimizing autonomous agent deployments.
major comments (2)
- [§3] §3 (compile-time characterization and representative workloads): The selection of workloads is described as capturing 'algorithmic diversity' but no verification, diversity metrics, or justification of representativeness (e.g., coverage of tool complexity or interaction patterns) is provided. This assumption is load-bearing for the central claim, as non-representative workloads would undermine the identified CPU bottlenecks and the reported efficacy of COMB and MAS.
- [§5] §5 (runtime characterization and experimental evaluation): The reported improvements (1.7x P50 latency for COMB; 3.9x/1.8x service/total and 2.37x/2.49x minority latency for MAS) lack details on statistical variance across runs, exact baseline scheduler implementations, workload selection criteria, and hardware-specific configurations. These omissions make it difficult to assess robustness and reproducibility of the bottleneck isolation and optimization claims.
minor comments (2)
- [Abstract] The abstract introduces COMB and MAS without expanding the acronyms on first use, which reduces immediate readability.
- [§5] Figure captions and axis labels in the runtime characterization plots could more explicitly state the exact metrics (P50 vs. P90) and load conditions for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript characterizing CPU bottlenecks in agentic AI serving. We address each major comment point by point below and have revised the paper to improve justification and reproducibility where the comments identify gaps.
read point-by-point responses
-
Referee: [§3] §3 (compile-time characterization and representative workloads): The selection of workloads is described as capturing 'algorithmic diversity' but no verification, diversity metrics, or justification of representativeness (e.g., coverage of tool complexity or interaction patterns) is provided. This assumption is load-bearing for the central claim, as non-representative workloads would undermine the identified CPU bottlenecks and the reported efficacy of COMB and MAS.
Authors: We agree that the original manuscript would benefit from explicit justification and metrics for workload representativeness. In the revised version, we have expanded Section 3 with a new subsection that details our selection criteria. Workloads were drawn from widely used agentic frameworks (LangChain, AutoGPT, and ReAct-style agents) to span variations in tool-call density, reasoning depth, and external service interaction patterns. We now report simple compile-time diversity metrics, including variance in CPU instruction counts, memory access patterns, and tool complexity scores across the chosen set. While these workloads do not exhaustively cover every conceivable agentic behavior, they target the primary sources of CPU heterogeneity that drive the bottlenecks analyzed in the paper. This addition directly supports the central claims without altering the experimental results. revision: yes
-
Referee: [§5] §5 (runtime characterization and experimental evaluation): The reported improvements (1.7x P50 latency for COMB; 3.9x/1.8x service/total and 2.37x/2.49x minority latency for MAS) lack details on statistical variance across runs, exact baseline scheduler implementations, workload selection criteria, and hardware-specific configurations. These omissions make it difficult to assess robustness and reproducibility of the bottleneck isolation and optimization claims.
Authors: We concur that additional experimental details are required for robustness and reproducibility. The revised Section 5 now includes: (1) statistical variance reported as mean ± standard deviation over five independent runs for all latency and throughput figures; (2) the baseline scheduler described as the default FIFO scheduler in our serving stack with no explicit CPU affinity or priority settings; (3) workload selection criteria explicitly linked to the compile-time analysis (high vs. low tool-call density and reasoning chain length); and (4) precise hardware configurations, including CPU model (Intel Xeon Gold 6248R), GPU (NVIDIA A100), memory bandwidth, and OS/kernel settings for each of the two systems. A new appendix provides configuration files and raw measurement data. These changes allow readers to replicate the bottleneck isolation and the reported gains from COMB and MAS. revision: yes
Circularity Check
No circularity: empirical characterization and scheduling optimizations rest on direct measurements
full rationale
The paper performs compile-time and runtime characterization of selected agentic AI workloads on heterogeneous CPU-GPU hardware, identifies bottlenecks through latency and throughput measurements, and evaluates two proposed schedulers (COMB for homogeneous and MAS for heterogeneous cases) via explicit experiments on two systems. All reported gains (1.7x P50 latency, 3.9x/1.8x service/total latency, 2.37x/2.49x minority latency) are obtained from these hardware runs rather than from any equations, fitted parameters renamed as predictions, or self-citations that close a definitional loop. Workload selection is presented as a methodological choice whose representativeness is an external assumption, not a self-referential derivation; no load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agentic AI serving heavily relies on heterogeneous CPU-GPU systems with majority of external tools run on or orchestrated by the CPU.
Forward citations
Cited by 5 Pith papers
-
Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems
Proposes EpG and OOI metrics showing agentic workflows use 4.33x more energy per successful goal than linear baselines due to orchestration structure.
-
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
SPEC CPU2026 increases instruction volume and memory footprint while shifting pressure to instruction-cache bottlenecks; 4-5 workload subsets per group preserve 96.4-99.9% of full-suite behavior and show complementary...
-
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
SPEC CPU2026 raises instruction volume and memory demands while shifting pressure to instruction caches; 4-5 workload subsets per group preserve 96.4-99.9% of full-suite microarchitectural behavior and better approxim...
-
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
-
LARA: Validation-Driven Agentic Supercomputer Workflows for Atomistic Modeling
LARA-HPC introduces a validation-first agentic system with dry-run verification and multi-phase refinement that improves robustness of AI-generated DFT workflows on HPC systems.
Reference graph
Works this paper leans on
-
[1]
AgentGPT.https://agentgpt.reworkd.ai/. LlamaIndex - Build Knowledge Assistants over your Enter- prise Data.https://www.llamaindex.ai/,. Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Efficient and scalable agentic ai with heteroge- neous systems.arXiv preprint arXiv:2507.19635, 2025
Anthropic. Claude code. https://www.claude.com/ product/claude-code. Asgar, Z., Nguyen, M., and Katti, S. Efficient and scalable agentic ai with heterogeneous systems.arXiv preprint arXiv:2507.19635,
-
[3]
Small Language Models are the Future of Agentic AI
URL https://arxiv.org/abs/2506.02153. Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., and Evans, O. Taken out of context: On measuring situational awareness in llms.arXiv preprint arXiv:2309.00667,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Measuring NUMA effects with the STREAM benchmark
Bergstrom, L. Measuring numa effects with the stream benchmark.arXiv preprint arXiv:1103.3225,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,
work page 1901
-
[6]
Documenting large webtext corpora: A case study on the colossal clean crawled corpus,
Deepset-Ai. haystack. https://github.com/ deepset-ai/haystack. Dodge, J., Sap, M., Marasovi ´c, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus.arXiv preprint arXiv:2104.08758,
-
[7]
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazar´e, P.-E., Lomeli, M., Hosseini, L., and J ´egou, H. The faiss library.arXiv preprint arXiv:2401.08281,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Gunasekar, S., Zhang, Y ., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[10]
Measuring Coding Challenge Competence With APPS
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language mod- els resolve real-world github issues?arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Kim, J., Shin, B., Chung, J., and Rhu, M. The cost of dynamic reasoning: Demystifying ai agents and test- time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301,
-
[16]
Internet-augmented dialogue generation
Komeili, M., Shuster, K., and Weston, J. Internet-augmented dialogue generation.arXiv preprint arXiv:2107.07566,
-
[17]
Mawps: A math word problem reposi- tory
Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. Mawps: A math word problem reposi- tory. InProceedings of the 2016 conference of the north american chapter of the association for computational lin- guistics: human language technologies, pp. 1152–1157,
work page 2016
-
[18]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
AgentBench: Evaluating LLMs as Agents
Liu, X., Yu, H., Zhang, H., Xu, Y ., Lei, X., Lai, H., Gu, Y ., Ding, H., Men, K., Yang, K., et al. Agentbench: Evalu- ating llms as agents.arXiv preprint arXiv:2308.03688,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
G., Van der Wijngaart, R., and Frumkin, M
Mattson, T. G., Van der Wijngaart, R., and Frumkin, M. Pro- gramming the intel 80-core network-on-a-chip terascale processor. InSC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–11. IEEE,
work page 2008
-
[21]
On faithfulness and factuality in abstractive summarization
Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661,
-
[22]
Miao, S.-Y ., Liang, C.-C., and Su, K.-Y . A diverse corpus for evaluating and developing english math word problem solvers.arXiv preprint arXiv:2106.15772,
-
[23]
WebGPT: Browser-assisted question-answering with human feedback
microsoft. GitHub - microsoft/semantic-kernel: Integrate cutting-edge LLM technology quickly and easily into your apps. https://github.com/microsoft/ semantic-kernel. Nakajima, Y . Babyagi, 2023.https://github.com/ yoheinakajima/babyagi. Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et a...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me the details, just the summary! topic-aware convolutional neu- ral networks for extreme summarization.arXiv preprint arXiv:1808.08745,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Balrog: Bench- marking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543, 2024
Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pignatelli, E., Kuci´nski, Ł., Pinto, L., Fer- gus, R., et al. Balrog: Benchmarking agentic llm and vlm reasoning on games.arXiv preprint arXiv:2411.13543,
-
[26]
From single core to multi-core: preparing for a new exponential
Parkhurst, J., Darringer, J., and Grundmann, B. From single core to multi-core: preparing for a new exponential. In Proceedings of the 2006 IEEE/ACM international confer- ence on Computer-aided design, pp. 67–72,
work page 2006
-
[27]
Are NLP Models really able to Solve Simple Math Word Problems?
Patel, A., Bhattamishra, S., and Goyal, N. Are nlp models really able to solve simple math word problems?arXiv preprint arXiv:2103.07191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Large language models can self-improve at web agent tasks
Patel, A., Hofmarcher, M., Leoveanu-Condrei, C., Dinu, M.-C., Callison-Burch, C., and Hochreiter, S. Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309,
-
[29]
Quinn, D., Nouri, M., Patel, N., Salihu, J., Salemi, A., Lee, S., Zamani, H., and Alian, M
Hugging Face model card; accessed 2025-10-05. Quinn, D., Nouri, M., Patel, N., Salihu, J., Salemi, A., Lee, S., Zamani, H., and Alian, M. Accelerating retrieval- augmented generation. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 15–32,
work page 2025
-
[30]
Know What You Don't Know: Unanswerable Questions for SQuAD
Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad.arXiv preprint arXiv:1806.03822,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Recasens, P. G., Agullo, F., Zhu, Y ., Wang, C., Lee, E. K., Tardieu, O., Torres, J., and Berral, J. L. Mind the mem- ory gap: Unveiling gpu bottlenecks in large-batch llm inference.arXiv preprint arXiv:2503.08311,
-
[32]
Agentic AI: A Conceptual Taxonomy, Applications and Challenges
Sapkota, R., Roumeliotis, K. I., and Karkee, M. Ai agents vs. agentic ai: A conceptual taxonomy, applications and challenges.arXiv preprint arXiv:2505.10468,
-
[33]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Shridhar, M., Yuan, X., C ˆot´e, M.-A., Bisk, Y ., Trischler, A., and Hausknecht, M. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[34]
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
Singh, J., Magazine, R., Pandya, Y ., and Nambi, A. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,
work page internal anchor Pith review arXiv
-
[35]
Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutier- rez, L., Tan, T. F., and Ting, D. S. W. Large language models in medicine.Nature medicine, 29(8):1930–1940,
work page 1930
-
[36]
URL https://arxiv.org/abs/ 2504.11750. Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y .-H., Zhou, D., Le, Q., et al. Freshllms: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214,
-
[37]
Large language models for education: A survey and outlook
Wang, S., Xu, T., Li, H., Zhang, C., Liang, J., Tang, J., Yu, P. S., and Wen, Q. Large language models for education: A survey and outlook.arXiv preprint arXiv:2403.18105,
-
[38]
Conveyor: Effi- cient tool-aware llm serving with tool partial execution
Xu, Y ., Kong, X., Chen, T., and Zhuo, D. Conveyor: Effi- cient tool-aware llm serving with tool partial execution. arXiv preprint arXiv:2406.00059,
-
[39]
Yang, H., Yue, S., and He, Y . Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224,
-
[40]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question an- swering.arXiv preprint arXiv:1809.09600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
OPT: Open Pre-trained Transformer Language Models
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Zhang, Y ., Wei, C., Wu, S., He, Z., and Yu, W. Geogpt: Understanding and processing geospatial tasks through an autonomous gpt.arXiv preprint arXiv:2307.07930,
-
[43]
A Survey of Large Language Models
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2),
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y .-X. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023a. Zhou, G., Hong, Y ., and Wu, Q. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on A...
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023b. Zhuo, T. Y ., Vu, M. C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., et al. Big- codebench: Bench...
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
A WORKLOADIMPLEMENTATIONDETAILS A.1 Toolformer We choose the same AI model (GPT-J 6B), calculation tool (WolframAlpha API (Wolfram—Alpha)) and mathematical benchmarks (ASDiv (Miao et al., 2021), SV AMP (Patel et al.,
work page 2021
-
[47]
and MAWPS (Koncel-Kedziorski et al., 2016)) for profiling as used in the original paper (Schick et al., 2023). A.2 SWE-Agent We choose mini-SWE-agent (SWE-agent), a research bench- marking version of SWE-agent using Qwen2.5-Coder-32B (Hui et al.,
work page 2016
-
[48]
We choose benchmarks derived from APPS (Hendrycks et al., 2021), BigCodeBench (Zhuo et al.,
model specifically suited for coding ap- plications. We choose benchmarks derived from APPS (Hendrycks et al., 2021), BigCodeBench (Zhuo et al.,
work page 2021
-
[49]
A.3 Haystack We choose ENNS top-5 retrieval using faiss FLAT (Douze et al.,
and DS-1000 (Lai et al., 2023), which are computation- ally intensive and can comprehensively showcase the CPU perspective. A.3 Haystack We choose ENNS top-5 retrieval using faiss FLAT (Douze et al.,
work page 2023
-
[50]
document corpus (305 GB english variant) for pro- filing using Natural Questions (NQ) (Kwiatkowski et al., 2019), HotpotQA (Yang et al.,
work page 2019
-
[51]
We evaluate the workload on FreshQA (Vu et al., 2023), MusiQue (Trivedi et al.,
summarizer for summarization and GPT-OSS-20B model for LLM inference. We evaluate the workload on FreshQA (Vu et al., 2023), MusiQue (Trivedi et al.,
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.