AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

Ali Shahin Shamsabadi; Dzung Pham; Hamed Haddadi; Kleomenis Katevas

arxiv: 2605.15206 · v1 · pith:WPYOEUJUnew · submitted 2026-05-01 · 💻 cs.LG · cs.AI· cs.DC

AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

Dzung Pham , Kleomenis Katevas , Ali Shahin Shamsabadi , Hamed Haddadi This is my paper

Pith reviewed 2026-05-19 16:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords early terminationlocal LLM agentsenergy efficiencyconsumer devicestoken log probabilitiesagentic workflowssustainable AIearly stopping

0 comments

The pith

Local LLM agents can stop themselves early using token log probabilities to cut wasted energy by 15-20% on consumer devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Local agents for tasks like web question answering and coding keep running on phones and laptops even after they are headed for failure, which raises power draw and battery drain compared with ordinary single-turn queries. The work shows that cheap signals already present during execution, such as how surprised the model is at each token, can flag doomed runs in advance. Stopping those runs early trims overall energy waste by 15 to 20 percent while the fraction of tasks that ultimately succeed falls by less than 5 percent. A reader who wants private, offline agents on everyday hardware would care because the extra cost of multi-step reasoning is the main obstacle to wider local deployment.

Core claim

AgentStop is a lightweight supervisor that monitors low-cost execution signals such as token-level log probabilities during agent runs. It predicts trajectories unlikely to succeed and terminates them before they consume more resources. On web-based question answering and coding benchmarks this reduces wasted energy by 15-20 percent with a utility drop below 5 percent.

What carries the argument

AgentStop, a predictive early-termination supervisor that uses token-level log probabilities and similar low-cost signals to decide when to halt agent trajectories.

If this is right

Local agent deployments on battery-powered devices become more practical because the extra power cost of iterative reasoning shrinks.
Privacy-preserving execution on consumer hardware competes better with cloud agents on total energy and heat.
The same termination logic can be added to other multi-step agent workloads without changing the underlying model.
Overall token consumption and run time drop for the subset of tasks that would have failed anyway.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same signals work across task families, early stopping could be added to local image or video generation pipelines to limit energy spikes.
The approach may need per-domain thresholds because a signal that looks like failure in one benchmark could precede a rare but valuable success in another.
Combining the supervisor with device-level power governors could compound the savings beyond what software termination alone achieves.

Load-bearing premise

Token-level log probabilities and other cheap execution signals are sufficiently predictive of final task failure to permit early termination without introducing many harmful false positives.

What would settle it

A controlled run in which every task that AgentStop would have terminated is instead allowed to finish, and the number of additional successes obtained is counted.

Figures

Figures reproduced from arXiv: 2605.15206 by Ali Shahin Shamsabadi, Dzung Pham, Hamed Haddadi, Kleomenis Katevas.

**Figure 1.** Figure 1: Profile of instantaneous power (left y-axis) and hardware temperature (right y-axis) over time (x-axis) on an Apple M1 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Cumulative energy contribution (%) (y-axis) of each [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 2.** Figure 2: Histogram of the number of agent steps (x-axis) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 5.** Figure 5: Histogram of the number of agent steps (x-axis) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Top 10 smallest logprobs (sorted) (y-axis) per agent [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 7.** Figure 7: Top 10 smallest logprobs (sorted) (y-axis) of step 1 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Energy Wastage Reduction % (y-axis) vs Task Utility [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: AgentStop’s average test AUC-ROC (with 95% CI) when deployed at fixed steps for mini-swe-agent with Qwen3-Coder-30B-A3B on SWE-Bench Verified. Energy Wastage Reduction % Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9 Step 10 0 5 10 15 20 25 0 5 10 15 20 25 (a) AgentStop 0 5 10 15 20 25 0 5 10 15 20 25 random min_logprob mean_logprob (b) Baselines Task Utility Drop % [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 11.** Figure 11: Energy Wastage Reduction % (y-axis) vs Task Util [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

read the original abstract

Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi-step tasks such as coding or web-based question answering. While remote, cloud-based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage-based fees. However, agentic workflows are far more resource-intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM-based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single-inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low-cost execution signals, such as token-level log probabilities, AgentStop can reduce wasted energy by 15-20% with minimal impact on task performance (<5% utility drop) for challenging web-based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy-preserving LLM agents on user devices. Our project code and data are available at https://github.com/brave-experiments/AgentStop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentStop shows you can trim 15-20% energy from local LLM agents by early termination on log-prob signals, but the predictor's accuracy on exploratory vs. failing paths is not yet well supported.

read the letter

The paper's main takeaway is straightforward: local agents waste a lot of energy on failed or looping trajectories, and a lightweight monitor using token log probabilities can cut that waste by 15-20% while keeping utility drop under 5% on web QA and coding benchmarks. They back this with direct measurements of GPU power, temperature, and battery impact on consumer hardware, which is more concrete than most simulation-only efficiency papers. Releasing code and data at the GitHub link is also helpful for anyone who wants to check the numbers or adapt the method. That combination of real-device data and open artifacts is the part that stands out as useful right now. The framing for privacy-preserving local deployment is a fair motivation, and the overhead comparison to single-inference workloads makes the problem clear. The core idea itself is an application of early stopping rather than a new mechanism, but the specific use on multi-step agent traces with low-cost runtime signals is a reasonable incremental step. The soft spot is exactly the one the stress-test note flags. Low log probabilities often appear during tool selection or partial reasoning that later succeeds, so the risk of false-positive terminations is real. The abstract gives headline savings and utility numbers but no precision-recall breakdown, no ablation across signals, and no count of how many ultimately successful runs were cut short. Without those details it is hard to judge whether the <5% utility bound holds or depends on careful threshold tuning for the particular benchmarks. This work is aimed at people building or deploying agents on laptops and phones who need to manage heat and battery life. It is worth sending to peer review because the hardware measurements are grounded and the practical question matters, even if the current evidence for the stopping rule needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper measures the increased GPU power draw, temperature, and battery drain of locally run LLM agents on consumer hardware for multi-step tasks such as web QA and coding. It introduces AgentStop, a lightweight supervisor that uses low-cost runtime signals (e.g., token-level log probabilities) to predict unsuccessful trajectories and terminate them early, claiming 15-20% reduction in wasted energy with under 5% utility loss on the evaluated benchmarks. Code and data are released at the cited GitHub repository.

Significance. If the quantitative claims are substantiated, the work offers a practical route to lowering the energy footprint of on-device agents while preserving privacy and avoiding cloud costs. The release of code and data supports reproducibility and allows independent verification of the reported savings.

major comments (2)

[Abstract] Abstract: the central claim of 15-20% energy reduction with <5% utility drop rests on the premise that token-level log probabilities (and similar signals) yield early-termination decisions with low false-positive rates on ultimately successful trajectories; however, no precision/recall figures, ablation on signal choice, or breakdown of terminated vs. completed successes are supplied to confirm this threshold is met.
[Methods] The manuscript states that agentic execution increases power draw relative to single-inference workloads, yet the methods section provides no details on the measurement apparatus (e.g., power meter model, sampling rate), hardware configuration, or statistical tests used to establish the reported increases.

minor comments (2)

[Abstract] The abstract refers to 'challenging web-based question answering and coding benchmarks' without naming the specific datasets or task suites; adding these names would improve clarity.
Consider including a short table or figure caption that explicitly lists the low-cost signals employed by AgentStop and their computational overhead relative to full inference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional details and analyses where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 15-20% energy reduction with <5% utility drop rests on the premise that token-level log probabilities (and similar signals) yield early-termination decisions with low false-positive rates on ultimately successful trajectories; however, no precision/recall figures, ablation on signal choice, or breakdown of terminated vs. completed successes are supplied to confirm this threshold is met.

Authors: We agree that precision/recall metrics, an ablation across signals, and a breakdown of terminated successful trajectories would more directly substantiate the low false-positive rate on ultimately successful runs. The revised manuscript adds these results to the Experiments section (new Table and Figure) and updates the abstract to reference the supporting metrics (false-positive rate below 4% on successful trajectories). revision: yes
Referee: [Methods] The manuscript states that agentic execution increases power draw relative to single-inference workloads, yet the methods section provides no details on the measurement apparatus (e.g., power meter model, sampling rate), hardware configuration, or statistical tests used to establish the reported increases.

Authors: We thank the referee for noting this gap. The revised Methods section now contains a dedicated paragraph describing the power-measurement apparatus (Keysight N6705C DC power analyzer, 1 kHz sampling), the exact consumer hardware (NVIDIA RTX 4060 Laptop GPU, Intel i7-13700H, 32 GB RAM), battery-drain logging procedure, and the statistical tests (paired t-tests with reported p-values and effect sizes) used to confirm the increases relative to single-inference baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement of signal-based early termination

full rationale

The paper describes an empirical supervisor (AgentStop) that applies observable runtime signals such as token-level log probabilities to decide early termination of agent trajectories. Energy savings (15-20%) and utility impact (<5% drop) are presented as measured results on web QA and coding benchmarks rather than quantities derived by construction from the termination rule. No equations, self-citations, or fitted parameters are shown that reduce the central claim to its own inputs. The derivation chain is self-contained against external benchmarks and code release.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical measurements of agent overhead and the assumption that cheap runtime signals can be turned into a reliable early-stopping rule; no new physical entities or complex axioms are introduced.

free parameters (1)

termination threshold or decision rule
A cutoff or model parameter that balances energy savings against utility loss must be chosen or tuned on observed trajectories.

axioms (1)

domain assumption Agentic workflows incur substantially higher token consumption and power draw than single-inference LLM calls due to iteration, tool use, and retries.
Stated directly in the abstract as the motivation for the overhead measurements.

pith-pipeline@v0.9.0 · 5804 in / 1263 out tokens · 45209 ms · 2026-05-19T16:31:23.737876+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Leveraging low-cost execution signals such as token-level log probabilities, AgentStop can reduce wasted energy by 15-20% with minimal impact on task performance (<5% utility drop)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 4 internal anchors

[1]

Anthropic. 2026. Claude API Pricing. https://platform.claude.com/docs/en/about- claude/pricing. Accessed: 2026-02-28

work page 2026
[2]

Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small Lan- guage Models are the Future of Agentic AI. arXiv:2506.02153 [cs.AI] https: //arxiv.org/abs/2506.02153

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(San Francisco, California, USA)(KDD ’16). Association for Computing Machinery, New York, NY, USA, 785–794. doi:10. 1145/2939672.2939785

work page arXiv 2016
[4]

Cooper Elsworth, Keguo Huang, David Patterson, Ian Schneider, Robert Sedivy, Savannah Goodman, Ben Townsend, Parthasarathy Ranganathan, Jeff Dean, Amin Vahdat, Ben Gomes, and James Manyika. 2025. Measuring the envi- ronmental impact of delivering AI at Google Scale. arXiv:2508.15734 [cs.AI] https://arxiv.org/abs/2508.15734

work page arXiv 2025
[5]

ggml.ai. n.d.. Llama.cpp. https://github.com/ggml-org/llama.cpp

work page
[6]

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. 2022. Why do tree-based models still outperform deep learning on typical tabular data?. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 37, 14 pages

work page 2022
[7]

Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. 2024. Language Model Cascades: Token-Level Uncertainty And Beyond. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=KgaBScZ4VI

work page 2024
[8]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[9]

Kleomenis Katevas, Ioannis Arapakis, and Martin Pielot. 2018. Typical phone use habits: Intense use does not predict negative well-being. InProceedings of the 20th international conference on human-computer interaction with mobile devices and services. 1–13

work page 2018
[10]

Jiin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu. 2025. The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective.arXiv preprint arXiv:2506.04301(2025)

work page arXiv 2025
[11]

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. 2025. Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technol...

work page doi:10.18653/v1/2025.naacl- 2025
[12]

Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, and Hamed Haddadi

work page
[13]

MELTing Point: Mobile evaluation of language transformers,

MELTing Point: Mobile Evaluation of Language Transformers. InProceed- ings of the 30th Annual International Conference on Mobile Computing and Network- ing(Washington D.C., DC, USA)(ACM MobiCom ’24). Association for Computing Machinery, New York, NY, USA, 890–907. doi:10.1145/3636534.3690668

work page doi:10.1145/3636534.3690668
[14]

Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Yijing Zeng, Jayaram Raghuram, and Suman Banerjee. 2025. PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=xzSUdw6s76

work page 2025
[15]

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2024. Generating with Confi- dence: Uncertainty Quantification for Black-box Large Language Models.Trans- actions on Machine Learning Research(2024). https://openreview.net/forum?id= DWkJCSxKU5

work page 2024
[16]

Qingyu Lu, Liang Ding, Siyi Cao, Xuebo Liu, Kanjian Zhang, Jinxia Zhang, and Dacheng Tao. 2025. Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments. In Findings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, a...

work page 2025
[17]

Lane, and Mengwei Xu

Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Wei Liu, Jian Luan, Xiwen Zhang, Nicholas D. Lane, and Mengwei Xu. 2025. Demystifying Small Language Models for Edge Deployment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and M...

work page 2025
[18]

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C., Ganesh Ramakrishnan, Micah Goldblum, and Colin White. 2023. When do neural nets outperform boosted trees on tabular data?. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY...

work page 2023
[19]

OpenAI. 2026. OpenAI API Pricing. https://openai.com/api/pricing/. Accessed: 2026-02-28

work page 2026
[20]

Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu

David Patterson, Jeffrey M. Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu. 2024. Energy and Emissions of Machine Learning on Smartphones vs. the Cloud.Commun. ACM67, 2 (Jan. 2024), 86–97. doi:10. 1145/3624719

work page 2024
[21]

Qwen. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv. org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. 2025. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingface/smolagents

work page 2025
[23]

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, and Christopher Ré. 2025. Intelligence per Watt: Mea- suring Intelligence Efficiency of Local AI. arXiv:2511.07885 [...

work page arXiv 2025
[24]

Samara OS Santos, Agustina Skiarski, Daniel García-Núñez, Victor Lazzarini, Rafael De Andrade Moral, Edgar Galvan, André LC Ottoni, and Erivelton Nepo- muceno. 2024. Green machine learning: Analysing the energy efficiency of machine learning models. In2024 35th Irish Signals and Systems Conference (ISSC). IEEE, 1–6

work page 2024
[25]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Con- ference Track Proceedings. OpenReview.net. https...

work page 2017
[26]

Tianxiang Sun, Xiangyang Liu, Wei Zhu, Zhichao Geng, Lingling Wu, Yilong He, Yuan Ni, Guotong Xie, Xuanjing Huang, and Xipeng Qiu. 2022. A Simple Hash- Based Early Exiting Approach For Language Understanding and Generation. In Findings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds...

work page doi:10.18653/v1/2022.findings- 2022
[27]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better LLM agents. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 2054, 25 pages

work page 2024
[28]

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. 2024. Measuring short-form factuality in large language models. arXiv:2411.04368 [cs.CL] https://arxiv.org/ abs/2411.04368

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InThe Thirty-eighth Annual Con- ference on Neural Information Processing Systems. https://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[31]

Caglar Yildirim and Ana-Paula Correia. 2015. Exploring the dimensions of nomo- phobia: Development and validation of a self-reported questionnaire.Computers in human behavior49 (2015), 130–137

work page 2015
[32]

Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. 2024. Large Language Model Cascades with Mixture of Thought Representations for Cost-Efficient Reasoning. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=6okaSfANzh

work page 2024
[33]

J., Liu, R., and Thomson, M

Michael J. Zellinger, Rex Liu, and Matt Thomson. 2025. Cost-Saving LLM Cascades with Early Abstention. arXiv:2502.09054 [cs.AI] https://arxiv.org/abs/2502.09054

work page arXiv 2025
[34]

Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. 2020. BERT loses patience: fast and robust inference with early exit. In Proceedings of the 34th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Article 1539, 12 pages

work page 2020
[35]

Be concise

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy K Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Antony Kellermann, Jasjeet S Sekhon, Jacob Steinhardt, Sarah Schwettmann, Arvind Narayanan, Matei Zaharia, Ion Stoi...

work page 2025
[36]

Always provide a'Thought:'sequence, and a'<code>'sequence ending with'</code>', else you will fail

work page
[37]

Use only variables that you have defined!

work page
[38]

What is the place where James Bond lives?

Always use the right arguments for the tools. DO NOT pass the arguments as a dict as in'answer = wikipedia_search({'query': "What is the place where James Bond lives?"})', but use the arguments directly as in'answer = wikipedia_search(query="What is the place where James Bond lives?")'

work page
[39]

Take care to not chain too many sequential tool calls in the same code block, especially when the output format is unpredictable. For instance, a call to wikipedia_search has an unpredictable return format, so do not have another tool call that depends on its output in the same block: rather output results with print() to use them in the next block

work page
[40]

Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters

work page
[41]

Don't name any new variable with the same name as a tool: for instance don't name a variable'final_answer'

work page
[42]

Never create any notional variables in our code, as having these in your logs will derail you from the true variables

work page
[43]

You can use imports in your code, but only from the following list of modules: {{authorized_imports}}

work page
[44]

The state persists between code executions: so if in one step you've created variables or imported modules, these will all persist. Now Begin! System Prompt for Coding Agent (adapted from mini-swe-agent v1.17.5) You are a helpful assistant that can interact multiple times with a computer shell to solve programming tasks. Your response must contain exactly...

work page 2026
[45]

Include a THOUGHT section explaining your reasoning and what you're trying to accomplish

work page
[46]

Provide exactly ONE bash command to execute ## Important Boundaries - MODIFY: Regular source code files in /testbed (this is the working directory for all your subsequent commands) - DO NOT MODIFY: Tests, configuration files (pyproject.toml, setup.cfg, etc.) ## Recommended Workflow

work page
[47]

Analyze the codebase by finding and reading relevant files

work page
[48]

Create a script to reproduce the issue

work page
[49]

Edit the source code to resolve the issue

work page
[50]

Verify your fix works by running your script again

work page
[51]

Test edge cases to ensure your fix is robust ## Command Execution Rules You are operating in an environment where

work page
[52]

You write a single command

work page
[53]

The system executes that command in a subshell

work page
[54]

You write your next command Each response should include:

work page
[55]

A **THOUGHT** section where you explain your reasoning and plan

work page
[56]

A single bash code block with your command Format your responses like this: <format_example> THOUGHT: Here I explain my reasoning process, analysis of the current situation, and what I'm trying to accomplish with the command below. ```bash your_command_here ``` </format_example> Commands must be specified in a single bash code block: ```bash your_command_...

work page 2026
[57]

Error occurred

Combine them in one block using && or || ```bash command1 && command2 || echo "Error occurred" ```

work page
[58]

CORRECT",

Wait for the first command to complete, see its output, then issue the next command in your following response. ## Environment Details - You have a full Linux shell environment - Always use non-interactive flags (-y, -f) for commands - Avoid interactive tools like vi, nano, or any that require user input - If a command isn't available, you can install it ...

work page 2026

[1] [1]

Anthropic. 2026. Claude API Pricing. https://platform.claude.com/docs/en/about- claude/pricing. Accessed: 2026-02-28

work page 2026

[2] [2]

Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small Lan- guage Models are the Future of Agentic AI. arXiv:2506.02153 [cs.AI] https: //arxiv.org/abs/2506.02153

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(San Francisco, California, USA)(KDD ’16). Association for Computing Machinery, New York, NY, USA, 785–794. doi:10. 1145/2939672.2939785

work page arXiv 2016

[4] [4]

Cooper Elsworth, Keguo Huang, David Patterson, Ian Schneider, Robert Sedivy, Savannah Goodman, Ben Townsend, Parthasarathy Ranganathan, Jeff Dean, Amin Vahdat, Ben Gomes, and James Manyika. 2025. Measuring the envi- ronmental impact of delivering AI at Google Scale. arXiv:2508.15734 [cs.AI] https://arxiv.org/abs/2508.15734

work page arXiv 2025

[5] [5]

ggml.ai. n.d.. Llama.cpp. https://github.com/ggml-org/llama.cpp

work page

[6] [6]

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. 2022. Why do tree-based models still outperform deep learning on typical tabular data?. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 37, 14 pages

work page 2022

[7] [7]

Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. 2024. Language Model Cascades: Token-Level Uncertainty And Beyond. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=KgaBScZ4VI

work page 2024

[8] [8]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66

work page 2024

[9] [9]

Kleomenis Katevas, Ioannis Arapakis, and Martin Pielot. 2018. Typical phone use habits: Intense use does not predict negative well-being. InProceedings of the 20th international conference on human-computer interaction with mobile devices and services. 1–13

work page 2018

[10] [10]

Jiin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu. 2025. The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective.arXiv preprint arXiv:2506.04301(2025)

work page arXiv 2025

[11] [11]

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. 2025. Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technol...

work page doi:10.18653/v1/2025.naacl- 2025

[12] [12]

Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, and Hamed Haddadi

work page

[13] [13]

MELTing Point: Mobile evaluation of language transformers,

MELTing Point: Mobile Evaluation of Language Transformers. InProceed- ings of the 30th Annual International Conference on Mobile Computing and Network- ing(Washington D.C., DC, USA)(ACM MobiCom ’24). Association for Computing Machinery, New York, NY, USA, 890–907. doi:10.1145/3636534.3690668

work page doi:10.1145/3636534.3690668

[14] [14]

Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Yijing Zeng, Jayaram Raghuram, and Suman Banerjee. 2025. PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=xzSUdw6s76

work page 2025

[15] [15]

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2024. Generating with Confi- dence: Uncertainty Quantification for Black-box Large Language Models.Trans- actions on Machine Learning Research(2024). https://openreview.net/forum?id= DWkJCSxKU5

work page 2024

[16] [16]

Qingyu Lu, Liang Ding, Siyi Cao, Xuebo Liu, Kanjian Zhang, Jinxia Zhang, and Dacheng Tao. 2025. Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments. In Findings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, a...

work page 2025

[17] [17]

Lane, and Mengwei Xu

Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Wei Liu, Jian Luan, Xiwen Zhang, Nicholas D. Lane, and Mengwei Xu. 2025. Demystifying Small Language Models for Edge Deployment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and M...

work page 2025

[18] [18]

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C., Ganesh Ramakrishnan, Micah Goldblum, and Colin White. 2023. When do neural nets outperform boosted trees on tabular data?. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY...

work page 2023

[19] [19]

OpenAI. 2026. OpenAI API Pricing. https://openai.com/api/pricing/. Accessed: 2026-02-28

work page 2026

[20] [20]

Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu

David Patterson, Jeffrey M. Gilbert, Marco Gruteser, Efren Robles, Krishna Sekar, Yong Wei, and Tenghui Zhu. 2024. Energy and Emissions of Machine Learning on Smartphones vs. the Cloud.Commun. ACM67, 2 (Jan. 2024), 86–97. doi:10. 1145/3624719

work page 2024

[21] [21]

Qwen. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv. org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. 2025. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingface/smolagents

work page 2025

[23] [23]

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, and Christopher Ré. 2025. Intelligence per Watt: Mea- suring Intelligence Efficiency of Local AI. arXiv:2511.07885 [...

work page arXiv 2025

[24] [24]

Samara OS Santos, Agustina Skiarski, Daniel García-Núñez, Victor Lazzarini, Rafael De Andrade Moral, Edgar Galvan, André LC Ottoni, and Erivelton Nepo- muceno. 2024. Green machine learning: Analysing the energy efficiency of machine learning models. In2024 35th Irish Signals and Systems Conference (ISSC). IEEE, 1–6

work page 2024

[25] [25]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Con- ference Track Proceedings. OpenReview.net. https...

work page 2017

[26] [26]

Tianxiang Sun, Xiangyang Liu, Wei Zhu, Zhichao Geng, Lingling Wu, Yilong He, Yuan Ni, Guotong Xie, Xuanjing Huang, and Xipeng Qiu. 2022. A Simple Hash- Based Early Exiting Approach For Language Understanding and Generation. In Findings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds...

work page doi:10.18653/v1/2022.findings- 2022

[27] [27]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better LLM agents. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 2054, 25 pages

work page 2024

[28] [28]

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. 2024. Measuring short-form factuality in large language models. arXiv:2411.04368 [cs.CL] https://arxiv.org/ abs/2411.04368

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InThe Thirty-eighth Annual Con- ference on Neural Information Processing Systems. https://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X

work page 2023

[31] [31]

Caglar Yildirim and Ana-Paula Correia. 2015. Exploring the dimensions of nomo- phobia: Development and validation of a self-reported questionnaire.Computers in human behavior49 (2015), 130–137

work page 2015

[32] [32]

Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. 2024. Large Language Model Cascades with Mixture of Thought Representations for Cost-Efficient Reasoning. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=6okaSfANzh

work page 2024

[33] [33]

J., Liu, R., and Thomson, M

Michael J. Zellinger, Rex Liu, and Matt Thomson. 2025. Cost-Saving LLM Cascades with Early Abstention. arXiv:2502.09054 [cs.AI] https://arxiv.org/abs/2502.09054

work page arXiv 2025

[34] [34]

Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. 2020. BERT loses patience: fast and robust inference with early exit. In Proceedings of the 34th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Article 1539, 12 pages

work page 2020

[35] [35]

Be concise

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy K Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Antony Kellermann, Jasjeet S Sekhon, Jacob Steinhardt, Sarah Schwettmann, Arvind Narayanan, Matei Zaharia, Ion Stoi...

work page 2025

[36] [36]

Always provide a'Thought:'sequence, and a'<code>'sequence ending with'</code>', else you will fail

work page

[37] [37]

Use only variables that you have defined!

work page

[38] [38]

What is the place where James Bond lives?

Always use the right arguments for the tools. DO NOT pass the arguments as a dict as in'answer = wikipedia_search({'query': "What is the place where James Bond lives?"})', but use the arguments directly as in'answer = wikipedia_search(query="What is the place where James Bond lives?")'

work page

[39] [39]

Take care to not chain too many sequential tool calls in the same code block, especially when the output format is unpredictable. For instance, a call to wikipedia_search has an unpredictable return format, so do not have another tool call that depends on its output in the same block: rather output results with print() to use them in the next block

work page

[40] [40]

Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters

work page

[41] [41]

Don't name any new variable with the same name as a tool: for instance don't name a variable'final_answer'

work page

[42] [42]

Never create any notional variables in our code, as having these in your logs will derail you from the true variables

work page

[43] [43]

You can use imports in your code, but only from the following list of modules: {{authorized_imports}}

work page

[44] [44]

The state persists between code executions: so if in one step you've created variables or imported modules, these will all persist. Now Begin! System Prompt for Coding Agent (adapted from mini-swe-agent v1.17.5) You are a helpful assistant that can interact multiple times with a computer shell to solve programming tasks. Your response must contain exactly...

work page 2026

[45] [45]

Include a THOUGHT section explaining your reasoning and what you're trying to accomplish

work page

[46] [46]

Provide exactly ONE bash command to execute ## Important Boundaries - MODIFY: Regular source code files in /testbed (this is the working directory for all your subsequent commands) - DO NOT MODIFY: Tests, configuration files (pyproject.toml, setup.cfg, etc.) ## Recommended Workflow

work page

[47] [47]

Analyze the codebase by finding and reading relevant files

work page

[48] [48]

Create a script to reproduce the issue

work page

[49] [49]

Edit the source code to resolve the issue

work page

[50] [50]

Verify your fix works by running your script again

work page

[51] [51]

Test edge cases to ensure your fix is robust ## Command Execution Rules You are operating in an environment where

work page

[52] [52]

You write a single command

work page

[53] [53]

The system executes that command in a subshell

work page

[54] [54]

You write your next command Each response should include:

work page

[55] [55]

A **THOUGHT** section where you explain your reasoning and plan

work page

[56] [56]

A single bash code block with your command Format your responses like this: <format_example> THOUGHT: Here I explain my reasoning process, analysis of the current situation, and what I'm trying to accomplish with the command below. ```bash your_command_here ``` </format_example> Commands must be specified in a single bash code block: ```bash your_command_...

work page 2026

[57] [57]

Error occurred

Combine them in one block using && or || ```bash command1 && command2 || echo "Error occurred" ```

work page

[58] [58]

CORRECT",

Wait for the first command to complete, see its output, then issue the next command in your following response. ## Environment Details - You have a full Linux shell environment - Always use non-interactive flags (-y, -f) for commands - Avoid interactive tools like vi, nano, or any that require user input - If a command isn't available, you can install it ...

work page 2026