Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Greg Durrett; Nicholas Tomlin; Wenxuan Ding

arxiv: 2602.16699 · v3 · pith:XXUS75CFnew · submitted 2026-02-18 · 💻 cs.CL · cs.AI

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Wenxuan Ding , Nicholas Tomlin , Greg Durrett This is my paper

Pith reviewed 2026-05-21 12:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM agentscost-aware explorationsequential decision makingcost-uncertainty tradeoffslatent environment stateCalibrate-Then-Actretrieval-augmented QAfile reading tasks

0 comments

The pith

LLM agents perform better when first given an inferred prior on hidden environment state so they can explicitly weigh cost-uncertainty tradeoffs before acting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes retrieval-augmented QA and file-reading coding tasks as sequential decisions under uncertainty, where each has a latent environment state that affects performance. It introduces the Calibrate-Then-Act framework that supplies the agent with an inferred prior over this state, enabling explicit reasoning about when exploration costs justify continued uncertainty. This prior qualitatively shifts agent behavior toward more environment-sensitive strategies that standard reinforcement learning does not produce. Experiments on synthetic tasks, QA, and coding show agents discover better stopping and commitment points once cost-benefit considerations are made explicit.

Core claim

We formalize multiple tasks, including retrieval-augmented QA and a file reading coding task, as sequential decision-making problems under uncertainty. Each problem has latent environment state that impacts the agent's performance. We introduce a framework called Calibrate-Then-Act (CTA), where we pass the agent an inferred prior about this environment state to enable it to act more optimally. This information qualitatively changes agent behavior, and adds environment sensitivity to the agent which is not learned via standard RL training. Our results on a synthetic task, QA, and file reading show that making cost-benefit tradeoffs explicit with CTA helps agents discover more optimal decision

What carries the argument

The Calibrate-Then-Act (CTA) framework that passes an inferred prior over latent environment state to the LLM agent so it can explicitly reason about cost-uncertainty tradeoffs.

If this is right

Agents stop exploring and commit to answers at points that better balance immediate costs against remaining uncertainty.
Task performance improves on problems that require gathering information before final output.
Decision policies become sensitive to the specific statistical properties of the latent environment state.
Useful strategies emerge without requiring additional reinforcement learning on the target tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior-passing step could be applied to other interactive settings such as web navigation or tool use where hidden state affects action costs.
If the prior can be estimated from limited interaction data, the approach might support rapid adaptation when environments change.
The results suggest that current LLM training leaves a gap in cost-sensitive reasoning that explicit calibration can address without retraining the model.

Load-bearing premise

That an inferred prior about the latent state can be presented to the LLM in a form that produces qualitatively different exploration and commitment behavior than is possible through standard prompting or reinforcement learning alone.

What would settle it

Agents that receive the inferred prior perform no better than agents without it on the QA or file-reading tasks, or exhibit identical exploration patterns and stopping rules.

Figures

Figures reproduced from arXiv: 2602.16699 by Greg Durrett, Nicholas Tomlin, Wenxuan Ding.

**Figure 1.** Figure 1: Given the same task, a coding agent may either verify assumptions via intermediate checks carefully (right) or attempt a direct solution as soon as possible (left). The optimal choice depends on uncertainty and specific cost constraints. CalibrateThen-Act (CTA) materializes this information for better decisionmaking. This exploration and its cost come in many forms. In software development and debugging,… view at source ↗

**Figure 2.** Figure 2: Standard agentic decision loop (left) and proposed method CTA with estimated priors (right). In CTA, we learn a prior estimator from training data and condition the agent on estimated pˆ at inference and/or training time, inducing more optimal decision making through explicit reasoning over prior probabilities. implemented either via a prompted LLM or through a model trained with reinforcement learning. Ho… view at source ↗

**Figure 3.** Figure 3: Model’s retrieval decision with respect to their confidence level kda and retrieval discount factor γ. Each dot corresponds to one question: green indicates the model directly answers, and red indicates it retrieves. The dashed line marks the oracle threshold: red region retrieves, green region directly answers. Models with calibrated priors closely align with the oracle decision rule, exhibiting more cost… view at source ↗

**Figure 4.** Figure 4: Action pattern distribution for prompting and RL-trained agents, with and without calibrated priors, across relative cost parameters ρ. Each stacked bar shows the proportion of decision traces corresponding to different action patterns, with the reward R labeled above. Annotated percentages indicate the fraction of tasks where the agent attempts code execution before any unit tests. 0.5 1.0 2.0 4.0 = log d… view at source ↗

**Figure 5.** Figure 5: Pareto frontier of average reward under varying costs. Static strategies (test-first or code-first) achieve high reward only in limited regimes, whereas CTA-RL with estimated priors consistently attains Pareto-optimal performance across cost settings. Jain et al., 2025), planning (Zhou et al., 2024; Liu et al., 2025), question answering (Yao et al., 2023; Eisenstein et al., 2025), and scientific research … view at source ↗

**Figure 6.** Figure 6: Example interaction trace on a 3-bag Pandora’s Box instance with priors (0.04, 0.68, 0.28) and discount factor γ = 0.2 with thinking mode disabled. In this setting, the model explores all bags before committing and follows a suboptimal verification order, rather than prioritizing the highest-probability option. A. Qualitative trace analysis of Pandora’s Box Problem We present representative interaction tra… view at source ↗

**Figure 7.** Figure 7: Example interaction trace on a 3-bag Pandora’s Box instance with priors (0.04, 0.68, 0.28) and discount factor γ = 0.2, where the model is not given access to the prior probabilities. In this setting, the model implicitly treats the bags as equally likely and follows a suboptimal strategy that deviates from the optimal policy. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Example model reasoning trace on a 3-bag Pandora’s Box instance with priors (0.04, 0.68, 0.28) and discount factor γ = 0.2. The model explicitly compares the expected value of immediate guessing versus verification and then chooses to guess B immediately, which is the optimal strategy in this case. Key reasoning steps, including the explicit comparison between action value and exploration cost, are highlig… view at source ↗

**Figure 9.** Figure 9: Prompt template for Pandora’s Box setting. E.3. Prompts for CODE Prompts used in the CODE setting are provided in Figures 11, 12, 13, and 14. F. Case study: Cost-Aware Decision Traces in CODE with CTA-RL and RL Figures 15 and 16 compare representative traces under a high relative code cost setting (ρ = 4.0). The RL model (trained without conditioning on explicit priors) tends to default to running unit tes… view at source ↗

**Figure 10.** Figure 10: Prompt templates for QA. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: System prompt for CODE. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Continuation of the system prompt for CODE. Instruction Prompt Template for CODE (without estimated prior) You are given a CSV file {csv name}. Your task: {task description} Additional context: • No format likelihoods are provided. • Make reasonable default assumptions about the CSV format based on common conventions, unless you choose to verify them with unit tests. Reward parameters: • Unit test discoun… view at source ↗

**Figure 13.** Figure 13: Instruction prompt template specifying the CSV task, reward parameters, and constraints provided to the agent. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Instruction prompt template with estimated CSV format likelihoods, enabling the agent to use probabilistic defaults when trading off unit tests, code execution, and early commitment. Case study: Example thinking trace of RL model (without estimated priors) (ρ = 4.0) Content: <think> Okay, let’s tackle this problem. The user has a CSV file named ‘race tsv sas.tsv‘ and they want the minimum salary value, ex… view at source ↗

**Figure 15.** Figure 15: Example reasoning trace of an RL-trained model without explicit prior conditioning in the CSV exploration task. Despite operating under the same high relative code cost setting (ρ = 4.0), the model defaults to verification-first behavior based on surface cues (e.g., file extension) and does not explicitly reason about uncertainty or cost trade-offs, illustrating a lack of adaptive decision-making compared… view at source ↗

**Figure 16.** Figure 16: Example reasoning trace of the CTA-RL model on the CODE task (ρ = 4.0), illustrating cost-aware trade-offs between unit tests and code execution under a high relative code cost setting, while jointly reasoning about format uncertainty. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

LLM agents are deployed in environments where they must interact to acquire information. In these scenarios, the agent must reason about inherent cost-uncertainty tradeoffs in how to act, such as when to stop exploring and commit to an answer. For instance, on a programming task, an agent might run the code it generates, or it might generate tests for that code snippet; the cost of writing and running a test is nonzero, but typically lower than the cost of running buggy code. In this work, we show that we can induce LLM agents to explicitly reason about balancing these cost-uncertainty tradeoffs, then act more optimally in their environments. We formalize multiple tasks, including retrieval-augmented QA and a file reading coding task, as sequential decision-making problems under uncertainty. Each problem has latent environment state that impacts the agent's performance. We introduce a framework called Calibrate-Then-Act (CTA), where we pass the agent an inferred prior about this environment state to enable it to act more optimally. This information qualitatively changes agent behavior, and adds environment sensitivity to the agent which is not learned via standard RL training. Our results on a synthetic task, QA, and file reading show that making cost-benefit tradeoffs explicit with CTA helps agents discover more optimal decision-making strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CTA gives LLM agents a clean way to inject an inferred prior on latent costs before acting, but the 'not achievable via standard RL' claim lacks the direct baseline comparison it needs.

read the letter

CTA gives LLM agents a clean way to inject an inferred prior on latent costs before acting, but the 'not achievable via standard RL' claim lacks the direct baseline comparison it needs. The authors formalize tasks like retrieval QA and file-reading coding as POMDPs with hidden state that affects performance. They then run a separate calibration step to infer a prior over that state and hand it to the LLM so the agent can explicitly weigh exploration costs against uncertainty. This separation is the main new piece: it lets the model reason about real tradeoffs without retraining the whole policy. The synthetic task plus the two applied ones are reasonable choices for showing the effect, and the examples of cheaper test runs before committing to code make the practical angle clear. The formalization stays simple and avoids self-referential loops. The soft spot is the RL comparison. The paper states that the added sensitivity is not learned via standard RL training, yet the reported experiments do not include an RL-trained agent on the same environments as a control. Without that, it is possible the reward signal alone could produce similar behavior, so the central distinction stays untested rather than demonstrated. The quantitative results are mentioned but not detailed enough in the abstract to judge effect size or variance. This is for people building LLM agents that interact with costly environments such as coding tools or retrieval systems. It deserves peer review because the core framing is practical and the tasks line up with real use cases, even if the RL baseline needs to be added to strengthen the main claim.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Calibrate-Then-Act (CTA), a framework that formalizes tasks such as retrieval-augmented QA and file-reading coding as POMDPs with latent environment states, infers a prior over that state, and passes it to an LLM agent to induce explicit reasoning about cost-uncertainty tradeoffs. The central claim is that this produces more optimal decision-making strategies on a synthetic task, QA, and file reading, and supplies environment sensitivity that cannot be achieved through standard RL training.

Significance. If the empirical claims are substantiated, the work could provide a lightweight way to equip LLM agents with explicit cost-benefit calibration in uncertain interactive settings, offering a potential complement to pure RL for applications where exploration costs matter.

major comments (1)

Abstract: the assertion that CTA 'adds environment sensitivity to the agent which is not learned via standard RL training' is load-bearing for the paper's novelty claim yet unsupported. No RL baseline (PPO, REINFORCE, or equivalent) trained on the identical synthetic, QA, or file-reading environments is reported, so it is impossible to determine whether the observed cost-sensitive behaviors are unique to CTA or could be acquired from reward signals alone.

minor comments (2)

The abstract states that 'results on three tasks support improved strategies' but supplies no quantitative metrics, error bars, or baseline comparisons, which prevents assessment of effect size or statistical reliability.
Clarify the precise format in which the inferred prior is communicated to the LLM and the exact prompting mechanism used to elicit the cost-benefit reasoning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide a point-by-point response to the major comment below.

read point-by-point responses

Referee: Abstract: the assertion that CTA 'adds environment sensitivity to the agent which is not learned via standard RL training' is load-bearing for the paper's novelty claim yet unsupported. No RL baseline (PPO, REINFORCE, or equivalent) trained on the identical synthetic, QA, or file-reading environments is reported, so it is impossible to determine whether the observed cost-sensitive behaviors are unique to CTA or could be acquired from reward signals alone.

Authors: We thank the referee for this observation. The claim in the abstract is intended to highlight that CTA enables explicit reasoning about environment-specific uncertainties by supplying a prior, which standard prompting or implicit RL optimization does not directly provide. Our empirical results on the synthetic task, QA, and file-reading demonstrate that agents using CTA exhibit cost-sensitive strategies that are absent in baseline LLM agents without the prior. We posit that acquiring similar sensitivity through standard RL would require substantial environment-specific training data and interactions, which is not the case for our zero-shot prior-based method. Nevertheless, we recognize that including RL baselines would provide stronger evidence. We will therefore revise the abstract to more precisely state that CTA induces environment sensitivity via explicit priors in a way that complements rather than replaces RL training, and we will expand the related work and discussion sections to elaborate on the distinctions from RL approaches. This revision will be incorporated in the next version of the manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in CTA derivation

full rationale

The paper formalizes tasks as POMDPs with latent environment state, then defines the CTA framework as passing an inferred prior to the LLM agent to induce explicit cost-uncertainty reasoning. This construction is presented as an external intervention rather than a self-referential loop; no equations reduce the claimed environment sensitivity to fitted parameters or prior outputs by definition. Empirical results on synthetic, QA, and file-reading tasks are reported as validation, with no load-bearing self-citations or ansatz smuggling identified in the derivation chain. The central claim of qualitative change beyond standard RL remains an empirical assertion rather than a definitional identity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM agents can productively use an externally supplied prior to reason about costs; no free parameters or new physical entities are introduced in the abstract description.

axioms (1)

domain assumption LLM agents can use a provided prior on latent environment state to reason about cost-uncertainty tradeoffs and change behavior beyond what standard RL training achieves
This premise is required for the CTA framework to produce the claimed qualitative change in agent behavior.

pith-pipeline@v0.9.0 · 5762 in / 1264 out tokens · 50148 ms · 2026-05-21T12:30:32.773876+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize multiple tasks... as sequential decision-making problems under uncertainty... pass the agent an inferred prior about this environment state
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the optimal policy proceeds as follows. Boxes are verified in decreasing order of prior probability. A box is committed to if its posterior probability is greater than γ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Exploration and Exploitation Errors Are Measurable for Language Model Agents
cs.AI 2026-04 unverdicted novelty 7.0

A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.
QuantClaw: Precision Where It Matters for OpenClaw
cs.AI 2026-04 unverdicted novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

org/CorpusID:281705844

URL https://api.semanticscholar. org/CorpusID:281705844. Agarwal, D., Majumder, B. P., Adamson, R., Chakravorty, M., Gavireddy, S. R., Parashar, A., Surana, H., Mishra, B. D., McCallum, A., Sabharwal, A., et al. Open- ended Scientific Discovery via Bayesian Surprise.arXiv preprint arXiv:2507.00310, 2025. Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao,...

work page doi:10.1006/jcss.2002 2025
[2]

ISBN 979-8-89176-256-5

URL https://www.sciencedirect.com/ science/article/pii/S0022000002918283. Chen, S., Chen, X., Huang, Y ., Xie, R., and Dhingra, B. When greedy wins: Emergent exploitation bias in meta-bandit llm training.ArXiv, abs/2509.24923, 2025a. URL https://api.semanticscholar. org/CorpusID:281674231. Chen, W., Yuan, J., Qian, C., Yang, C., Liu, Z., and Sun, M. Optim...

work page doi:10.18653/v1/2025.findings-acl 2025
[3]

findings-acl.601/

URL https://aclanthology.org/2025. findings-acl.601/. Choi, J., Bansal, M., and Stengel-Eskin, E. Language mod- els identify ambiguities and exploit loopholes. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 32991–33006, 2025. Cole, J. R., Zhang, M. J., Gillick, D., Eisenschlos, J. M., Dhingra, B., and Eisen...

work page 2025
[4]

Learning how hard to think: Input-adaptive allocation of lm computation.arXiv preprint arXiv:2410.04707,

URL https://openreview.net/forum? id=x2W2dKdNI8. Damani, M., Shenfeld, I., Peng, A., Bobu, A., and An- dreas, J. Learning how hard to think: Input-adaptive allocation of lm computation.ArXiv, abs/2410.04707,

work page arXiv
[5]

org/CorpusID:273186996

URL https://api.semanticscholar. org/CorpusID:273186996. Deng, M., Huang, L., Fan, Y ., Zhang, J., Ren, F., Bai, J., Yang, F., Miao, D., Yu, Z., Wu, Y ., Zhang, Y ., Teng, F., Wan, Y ., Hu, S., Li, Y ., Jin, X., Hu, C., Li, H., Fu, Q., Zhong, T., Wang, X., Tang, X., Tang, N., Wu, C., and Luo, Y . InteractComp: Evaluating Search Agents With Ambiguous Queri...

work page arXiv
[6]

Ellie Pavlick and Tom Kwiatkowski

URL https://api.semanticscholar. org/CorpusID:282401680. Desai, S. and Durrett, G. Calibration of pre-trained trans- formers. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Proceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pp. 295–302, Online, November 2020. Association for Computational Linguistics. ...

work page doi:10.18653/v1/2020 2020
[7]

Elfleet, M

URL https://openreview.net/forum? id=2vDJiGUfhV. Elfleet, M. and Chollet, M. Investigating the Impact of Multimodal Feedback on User-Perceived Latency and Immersion with LLM-Powered Embodied Conver- sational Agents in Virtual Reality. InIVA, pp. 12:1– 12:9, 2024. URL https://doi.org/10.1145/ 3652988.3673965. Grand, G., Pepe, V ., Andreas, J., and Tenenbau...

work page arXiv 2024
[8]

GX-Chen, A., Lin, D., Samiei, M., Precup, D., Richards, B

URL https://openreview.net/forum? id=dIEeOwrmOe. GX-Chen, A., Lin, D., Samiei, M., Precup, D., Richards, B. A., Fergus, R., and Marino, K. Language agents mir- ror human causal reasoning biases. how can we help them think like scientists?ArXiv, abs/2505.09614,

work page arXiv
[9]

org/CorpusID:278602122

URL https://api.semanticscholar. org/CorpusID:278602122. Handa, K., Gal, Y ., Pavlick, E., Goodman, N., Andreas, J., Tamkin, A., and Li, B. Z. Bayesian preference elicitation with language models.arXiv preprint arXiv:2403.05534, 2024. Hennig, L., Tornede, T., and Lindauer, M. Towards lever- aging AutoML for sustainable deep learning: A multi- objective HP...

work page arXiv 2024
[10]

URL https://openreview.net/forum? id=jKN1pXi7b0. Jain, A. K., Gonzalez-Pumariega, G., Chen, W., Rush, A. M., Zhao, W., and Choudhury, S. Multi-turn code generation through single-step rewards. InForty- second International Conference on Machine Learning,

work page
[11]

10 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Ji, S

URL https://openreview.net/forum? id=aJeLhLcsh0. 10 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Ji, S. and Carin, L. Cost-sensitive feature acquisition and classification.Pattern Recognition, 40(5):1474–1485, 2007. K¨arkk¨ainen, K., Kachuee, M., Goldstein, O., and Sar- rafzadeh, M. Cost-sensitive feature-value acquisition using feature releva...

work page arXiv 2007
[12]

org/CorpusID:282064346

URL https://api.semanticscholar. org/CorpusID:282064346. Lalai, H. N., Shah, R. S., Pei, J., Varma, S., Wang, Y .-C., and Emami, A. The world according to LLMs: How geographic origin influences LLMs’ entity deduction ca- pabilities. InSecond Conference on Language Modeling,

work page
[13]

URL https://openreview.net/forum? id=hJtvCfDfs1. Li, B. Z., Kim, B., and Wang, Z. Questbench: Can LLMs ask the right question to acquire information in reasoning tasks? InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Bench- marks Track, 2025. URL https://openreview. net/forum?id=gpwA9aZLTZ. Li, Y . and Oliva, J...

work page internal anchor Pith review doi:10.1162/tacl 2025
[14]

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

URL https://api.semanticscholar. org/CorpusID:283933928. Liu, J., Qian, C., Su, Z., Zong, Q., Huang, S., He, B., and Fung, Y . R. CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic En- vironments for LLM Tool-Use Agents.arXiv preprint arXiv:2511.02734, 2025. Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., and Hajishi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.546 2025
[15]

emnlp-main.466/

URL https://aclanthology.org/2020. emnlp-main.466/. Mohri, C. and Hashimoto, T. Language models with confor- mal factuality guarantees. InProceedings of the 41st In- ternational Conference on Machine Learning, pp. 36029– 36047, 2024. Monea, G., Bosselut, A., Brantley, K., and Artzi, Y . LLMs Are In-Context Bandit Reinforcement Learners.arXiv preprint arXi...

work page doi:10.18653/v1/p18-1255 2020
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://proceedings.neurips. 11 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents cc/paper_files/paper/2023/file/ ef0164c1112f56246224af540857348f-Paper-Datasets_ and_Benchmarks.pdf. Shaikh, O., Mozannar, H., Bansal, G., Fourney, A., and Horvitz, E. Navigating rifts in human-LLM ground- ing: Study and benchmark. In Che, W., Nabende, J., Shutova...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.1016 2023
[17]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

doi: 10.18653/v1/2025.acl-long.887. URL https: //aclanthology.org/2025.acl-long.887/. Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Yu, K., Nguyen, M. N., Liu, L., Got- tlieb, E., Lam, M., Lu, Y ., Cho, K., Wu, J., Li, F.- F., Wang, L., Choi, Y ., and Li, M. RAGEN: Un- derstanding Self-Evolution in LLM Agents via Multi- Turn Reinforcement Lea...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.887 2025
[18]

org/CorpusID:259224900

URL https://api.semanticscholar. org/CorpusID:259224900. Wu, S., Galley, M., Peng, B., Cheng, H., Li, G., Dou, Y ., Cai, W., Zou, J., Leskovec, J., and Gao, J. CollabLLM: From passive responders to active collaborators. InForty- second International Conference on Machine Learning,

work page
[19]

Xiong, M., Hu, Z., Lu, X., LI, Y ., Fu, J., He, J., and Hooi, B

URL https://openreview.net/forum? id=DmH4HHVb3y. Xiong, M., Hu, Z., Lu, X., LI, Y ., Fu, J., He, J., and Hooi, B. Can LLMs express their uncertainty? an empirical evalu- ation of confidence elicitation in LLMs. InThe Twelfth International Conference on Learning Representations,

work page
[20]

Steering LLM reasoning through bias-only adaptation

URL https://openreview.net/forum? id=gjeQKFxFpZ. Xu, Y ., Chen, Z., and Wen, Z. EcoTune: Token- efficient multi-fidelity hyperparameter optimization for large language model inference. In Christodoulopou- los, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, pp. 7...

work page doi:10.18653/v1/2025.emnlp-main 2025
[21]

emnlp-main.394/

URL https://aclanthology.org/2025. emnlp-main.394/. 12 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Yang, H., Yue, S., and He, Y . Auto-GPT for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. SWE-agent: Age...

work page doi:10.52202/079017-1601 2025
[22]

Zadrozny, B

URL https://openreview.net/forum? id=WE_vluYUL-X. Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. InProceed- ings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699, 2002. Zhang, M. J. and Choi, E. Clarify when necessary: Resolv- ing ambiguity thro...

work page doi:10.18653/v1/2025.findings-naacl 2002
[23]

findings-naacl.306/

URL https://aclanthology.org/2025. findings-naacl.306/. Zhang, M. J., Knox, W. B., and Choi, E. Modeling fu- ture conversation turns to teach LLMs to ask clarify- ing questions. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=cwuSAR7EKd. Zhou, L., Shi, J., Gao, J., and Wang, D. Credit-bud...

work page 2025
[24]

Then, I know B has the prize

The result is YES (probability 0.68). Then, I know B has the prize. So, I can guess B at t=1, which would give me a reward of 1∗0.2 1 = 0.2

work page
[25]

Then, B does not have the prize

The result is NO (probability 1−0.68 = 0.32 ). Then, B does not have the prize. Now, I have to decide whether to guess or verify another bag. But if I get NO on B, then the remaining bags are A and C. Their probabilities would be updated based on the fact that B is not the correct one. Wait, but the problem says that the bags are independent. Wait, actual...

work page
[26]

file.csv

UNIT TESTS Run unit tests to debug CSV formatting assumptions. Unit test outputs are perfectly reliable. Available unit tests: •test delimiter(path)→ {‘,’, ‘;’, ‘\t’} •test quotechar(path)→ {‘”’, ‘”} •test skiprows(path)→ {0, 1} Format (NO code fences): UNIT TESTS: test delimiter("file.csv"), test quotechar("file.csv") You may include multiple unit tests ...

work page
[27]

• Enclose code in‘‘‘python

CODE Write Python code toward solving the task using your current assumptions about the CSV format. • Enclose code in‘‘‘python ... ‘‘‘ • You may import pandas as pd and read the file with: pd.read csv(filepath, delimiter=..., quotechar=..., skiprows=...) • Do NOT print the entire CSV . • If your code computes the final result, print it to stdout so it can...

work page
[28]

race tsv sas.tsv

ANSWER Provide the final answer to the task and end the conversation. Format exactly:ANSWER: <your answer> The conversation ends immediately after you provide ANSWER. Reward: • LetUbe the total number of unit tests used. • LetCbe the total number of CODE actions taken. • Final reward=correctness×(d unit)U ×(d code)C. • Discount factors represent cost mult...

work page

[1] [1]

org/CorpusID:281705844

URL https://api.semanticscholar. org/CorpusID:281705844. Agarwal, D., Majumder, B. P., Adamson, R., Chakravorty, M., Gavireddy, S. R., Parashar, A., Surana, H., Mishra, B. D., McCallum, A., Sabharwal, A., et al. Open- ended Scientific Discovery via Bayesian Surprise.arXiv preprint arXiv:2507.00310, 2025. Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao,...

work page doi:10.1006/jcss.2002 2025

[2] [2]

ISBN 979-8-89176-256-5

URL https://www.sciencedirect.com/ science/article/pii/S0022000002918283. Chen, S., Chen, X., Huang, Y ., Xie, R., and Dhingra, B. When greedy wins: Emergent exploitation bias in meta-bandit llm training.ArXiv, abs/2509.24923, 2025a. URL https://api.semanticscholar. org/CorpusID:281674231. Chen, W., Yuan, J., Qian, C., Yang, C., Liu, Z., and Sun, M. Optim...

work page doi:10.18653/v1/2025.findings-acl 2025

[3] [3]

findings-acl.601/

URL https://aclanthology.org/2025. findings-acl.601/. Choi, J., Bansal, M., and Stengel-Eskin, E. Language mod- els identify ambiguities and exploit loopholes. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 32991–33006, 2025. Cole, J. R., Zhang, M. J., Gillick, D., Eisenschlos, J. M., Dhingra, B., and Eisen...

work page 2025

[4] [4]

Learning how hard to think: Input-adaptive allocation of lm computation.arXiv preprint arXiv:2410.04707,

URL https://openreview.net/forum? id=x2W2dKdNI8. Damani, M., Shenfeld, I., Peng, A., Bobu, A., and An- dreas, J. Learning how hard to think: Input-adaptive allocation of lm computation.ArXiv, abs/2410.04707,

work page arXiv

[5] [5]

org/CorpusID:273186996

URL https://api.semanticscholar. org/CorpusID:273186996. Deng, M., Huang, L., Fan, Y ., Zhang, J., Ren, F., Bai, J., Yang, F., Miao, D., Yu, Z., Wu, Y ., Zhang, Y ., Teng, F., Wan, Y ., Hu, S., Li, Y ., Jin, X., Hu, C., Li, H., Fu, Q., Zhong, T., Wang, X., Tang, X., Tang, N., Wu, C., and Luo, Y . InteractComp: Evaluating Search Agents With Ambiguous Queri...

work page arXiv

[6] [6]

Ellie Pavlick and Tom Kwiatkowski

URL https://api.semanticscholar. org/CorpusID:282401680. Desai, S. and Durrett, G. Calibration of pre-trained trans- formers. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Proceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pp. 295–302, Online, November 2020. Association for Computational Linguistics. ...

work page doi:10.18653/v1/2020 2020

[7] [7]

Elfleet, M

URL https://openreview.net/forum? id=2vDJiGUfhV. Elfleet, M. and Chollet, M. Investigating the Impact of Multimodal Feedback on User-Perceived Latency and Immersion with LLM-Powered Embodied Conver- sational Agents in Virtual Reality. InIVA, pp. 12:1– 12:9, 2024. URL https://doi.org/10.1145/ 3652988.3673965. Grand, G., Pepe, V ., Andreas, J., and Tenenbau...

work page arXiv 2024

[8] [8]

GX-Chen, A., Lin, D., Samiei, M., Precup, D., Richards, B

URL https://openreview.net/forum? id=dIEeOwrmOe. GX-Chen, A., Lin, D., Samiei, M., Precup, D., Richards, B. A., Fergus, R., and Marino, K. Language agents mir- ror human causal reasoning biases. how can we help them think like scientists?ArXiv, abs/2505.09614,

work page arXiv

[9] [9]

org/CorpusID:278602122

URL https://api.semanticscholar. org/CorpusID:278602122. Handa, K., Gal, Y ., Pavlick, E., Goodman, N., Andreas, J., Tamkin, A., and Li, B. Z. Bayesian preference elicitation with language models.arXiv preprint arXiv:2403.05534, 2024. Hennig, L., Tornede, T., and Lindauer, M. Towards lever- aging AutoML for sustainable deep learning: A multi- objective HP...

work page arXiv 2024

[10] [10]

URL https://openreview.net/forum? id=jKN1pXi7b0. Jain, A. K., Gonzalez-Pumariega, G., Chen, W., Rush, A. M., Zhao, W., and Choudhury, S. Multi-turn code generation through single-step rewards. InForty- second International Conference on Machine Learning,

work page

[11] [11]

10 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Ji, S

URL https://openreview.net/forum? id=aJeLhLcsh0. 10 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Ji, S. and Carin, L. Cost-sensitive feature acquisition and classification.Pattern Recognition, 40(5):1474–1485, 2007. K¨arkk¨ainen, K., Kachuee, M., Goldstein, O., and Sar- rafzadeh, M. Cost-sensitive feature-value acquisition using feature releva...

work page arXiv 2007

[12] [12]

org/CorpusID:282064346

URL https://api.semanticscholar. org/CorpusID:282064346. Lalai, H. N., Shah, R. S., Pei, J., Varma, S., Wang, Y .-C., and Emami, A. The world according to LLMs: How geographic origin influences LLMs’ entity deduction ca- pabilities. InSecond Conference on Language Modeling,

work page

[13] [13]

URL https://openreview.net/forum? id=hJtvCfDfs1. Li, B. Z., Kim, B., and Wang, Z. Questbench: Can LLMs ask the right question to acquire information in reasoning tasks? InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Bench- marks Track, 2025. URL https://openreview. net/forum?id=gpwA9aZLTZ. Li, Y . and Oliva, J...

work page internal anchor Pith review doi:10.1162/tacl 2025

[14] [14]

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

URL https://api.semanticscholar. org/CorpusID:283933928. Liu, J., Qian, C., Su, Z., Zong, Q., Huang, S., He, B., and Fung, Y . R. CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic En- vironments for LLM Tool-Use Agents.arXiv preprint arXiv:2511.02734, 2025. Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., and Hajishi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.546 2025

[15] [15]

emnlp-main.466/

URL https://aclanthology.org/2020. emnlp-main.466/. Mohri, C. and Hashimoto, T. Language models with confor- mal factuality guarantees. InProceedings of the 41st In- ternational Conference on Machine Learning, pp. 36029– 36047, 2024. Monea, G., Bosselut, A., Brantley, K., and Artzi, Y . LLMs Are In-Context Bandit Reinforcement Learners.arXiv preprint arXi...

work page doi:10.18653/v1/p18-1255 2020

[16] [16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://proceedings.neurips. 11 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents cc/paper_files/paper/2023/file/ ef0164c1112f56246224af540857348f-Paper-Datasets_ and_Benchmarks.pdf. Shaikh, O., Mozannar, H., Bansal, G., Fourney, A., and Horvitz, E. Navigating rifts in human-LLM ground- ing: Study and benchmark. In Che, W., Nabende, J., Shutova...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.1016 2023

[17] [17]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

doi: 10.18653/v1/2025.acl-long.887. URL https: //aclanthology.org/2025.acl-long.887/. Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Yu, K., Nguyen, M. N., Liu, L., Got- tlieb, E., Lam, M., Lu, Y ., Cho, K., Wu, J., Li, F.- F., Wang, L., Choi, Y ., and Li, M. RAGEN: Un- derstanding Self-Evolution in LLM Agents via Multi- Turn Reinforcement Lea...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.887 2025

[18] [18]

org/CorpusID:259224900

URL https://api.semanticscholar. org/CorpusID:259224900. Wu, S., Galley, M., Peng, B., Cheng, H., Li, G., Dou, Y ., Cai, W., Zou, J., Leskovec, J., and Gao, J. CollabLLM: From passive responders to active collaborators. InForty- second International Conference on Machine Learning,

work page

[19] [19]

Xiong, M., Hu, Z., Lu, X., LI, Y ., Fu, J., He, J., and Hooi, B

URL https://openreview.net/forum? id=DmH4HHVb3y. Xiong, M., Hu, Z., Lu, X., LI, Y ., Fu, J., He, J., and Hooi, B. Can LLMs express their uncertainty? an empirical evalu- ation of confidence elicitation in LLMs. InThe Twelfth International Conference on Learning Representations,

work page

[20] [20]

Steering LLM reasoning through bias-only adaptation

URL https://openreview.net/forum? id=gjeQKFxFpZ. Xu, Y ., Chen, Z., and Wen, Z. EcoTune: Token- efficient multi-fidelity hyperparameter optimization for large language model inference. In Christodoulopou- los, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, pp. 7...

work page doi:10.18653/v1/2025.emnlp-main 2025

[21] [21]

emnlp-main.394/

URL https://aclanthology.org/2025. emnlp-main.394/. 12 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Yang, H., Yue, S., and He, Y . Auto-GPT for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. SWE-agent: Age...

work page doi:10.52202/079017-1601 2025

[22] [22]

Zadrozny, B

URL https://openreview.net/forum? id=WE_vluYUL-X. Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. InProceed- ings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699, 2002. Zhang, M. J. and Choi, E. Clarify when necessary: Resolv- ing ambiguity thro...

work page doi:10.18653/v1/2025.findings-naacl 2002

[23] [23]

findings-naacl.306/

URL https://aclanthology.org/2025. findings-naacl.306/. Zhang, M. J., Knox, W. B., and Choi, E. Modeling fu- ture conversation turns to teach LLMs to ask clarify- ing questions. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=cwuSAR7EKd. Zhou, L., Shi, J., Gao, J., and Wang, D. Credit-bud...

work page 2025

[24] [24]

Then, I know B has the prize

The result is YES (probability 0.68). Then, I know B has the prize. So, I can guess B at t=1, which would give me a reward of 1∗0.2 1 = 0.2

work page

[25] [25]

Then, B does not have the prize

The result is NO (probability 1−0.68 = 0.32 ). Then, B does not have the prize. Now, I have to decide whether to guess or verify another bag. But if I get NO on B, then the remaining bags are A and C. Their probabilities would be updated based on the fact that B is not the correct one. Wait, but the problem says that the bags are independent. Wait, actual...

work page

[26] [26]

file.csv

UNIT TESTS Run unit tests to debug CSV formatting assumptions. Unit test outputs are perfectly reliable. Available unit tests: •test delimiter(path)→ {‘,’, ‘;’, ‘\t’} •test quotechar(path)→ {‘”’, ‘”} •test skiprows(path)→ {0, 1} Format (NO code fences): UNIT TESTS: test delimiter("file.csv"), test quotechar("file.csv") You may include multiple unit tests ...

work page

[27] [27]

• Enclose code in‘‘‘python

CODE Write Python code toward solving the task using your current assumptions about the CSV format. • Enclose code in‘‘‘python ... ‘‘‘ • You may import pandas as pd and read the file with: pd.read csv(filepath, delimiter=..., quotechar=..., skiprows=...) • Do NOT print the entire CSV . • If your code computes the final result, print it to stdout so it can...

work page

[28] [28]

race tsv sas.tsv

ANSWER Provide the final answer to the task and end the conversation. Format exactly:ANSWER: <your answer> The conversation ends immediately after you provide ANSWER. Reward: • LetUbe the total number of unit tests used. • LetCbe the total number of CODE actions taken. • Final reward=correctness×(d unit)U ×(d code)C. • Discount factors represent cost mult...

work page