pith. sign in

arxiv: 2602.16699 · v3 · pith:XXUS75CFnew · submitted 2026-02-18 · 💻 cs.CL · cs.AI

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Pith reviewed 2026-05-21 12:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM agentscost-aware explorationsequential decision makingcost-uncertainty tradeoffslatent environment stateCalibrate-Then-Actretrieval-augmented QAfile reading tasks
0
0 comments X

The pith

LLM agents perform better when first given an inferred prior on hidden environment state so they can explicitly weigh cost-uncertainty tradeoffs before acting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes retrieval-augmented QA and file-reading coding tasks as sequential decisions under uncertainty, where each has a latent environment state that affects performance. It introduces the Calibrate-Then-Act framework that supplies the agent with an inferred prior over this state, enabling explicit reasoning about when exploration costs justify continued uncertainty. This prior qualitatively shifts agent behavior toward more environment-sensitive strategies that standard reinforcement learning does not produce. Experiments on synthetic tasks, QA, and coding show agents discover better stopping and commitment points once cost-benefit considerations are made explicit.

Core claim

We formalize multiple tasks, including retrieval-augmented QA and a file reading coding task, as sequential decision-making problems under uncertainty. Each problem has latent environment state that impacts the agent's performance. We introduce a framework called Calibrate-Then-Act (CTA), where we pass the agent an inferred prior about this environment state to enable it to act more optimally. This information qualitatively changes agent behavior, and adds environment sensitivity to the agent which is not learned via standard RL training. Our results on a synthetic task, QA, and file reading show that making cost-benefit tradeoffs explicit with CTA helps agents discover more optimal decision

What carries the argument

The Calibrate-Then-Act (CTA) framework that passes an inferred prior over latent environment state to the LLM agent so it can explicitly reason about cost-uncertainty tradeoffs.

If this is right

  • Agents stop exploring and commit to answers at points that better balance immediate costs against remaining uncertainty.
  • Task performance improves on problems that require gathering information before final output.
  • Decision policies become sensitive to the specific statistical properties of the latent environment state.
  • Useful strategies emerge without requiring additional reinforcement learning on the target tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prior-passing step could be applied to other interactive settings such as web navigation or tool use where hidden state affects action costs.
  • If the prior can be estimated from limited interaction data, the approach might support rapid adaptation when environments change.
  • The results suggest that current LLM training leaves a gap in cost-sensitive reasoning that explicit calibration can address without retraining the model.

Load-bearing premise

That an inferred prior about the latent state can be presented to the LLM in a form that produces qualitatively different exploration and commitment behavior than is possible through standard prompting or reinforcement learning alone.

What would settle it

Agents that receive the inferred prior perform no better than agents without it on the QA or file-reading tasks, or exhibit identical exploration patterns and stopping rules.

Figures

Figures reproduced from arXiv: 2602.16699 by Greg Durrett, Nicholas Tomlin, Wenxuan Ding.

Figure 1
Figure 1. Figure 1: Given the same task, a coding agent may either verify assumptions via intermediate checks carefully (right) or attempt a direct solution as soon as possible (left). The optimal choice depends on uncertainty and specific cost constraints. Calibrate￾Then-Act (CTA) materializes this information for better decision￾making. This exploration and its cost come in many forms. In software development and debugging,… view at source ↗
Figure 2
Figure 2. Figure 2: Standard agentic decision loop (left) and proposed method CTA with estimated priors (right). In CTA, we learn a prior estimator from training data and condition the agent on estimated pˆ at inference and/or training time, inducing more optimal decision making through explicit reasoning over prior probabilities. implemented either via a prompted LLM or through a model trained with reinforcement learning. Ho… view at source ↗
Figure 3
Figure 3. Figure 3: Model’s retrieval decision with respect to their confidence level kda and retrieval discount factor γ. Each dot corresponds to one question: green indicates the model directly answers, and red indicates it retrieves. The dashed line marks the oracle threshold: red region retrieves, green region directly answers. Models with calibrated priors closely align with the oracle decision rule, exhibiting more cost… view at source ↗
Figure 4
Figure 4. Figure 4: Action pattern distribution for prompting and RL-trained agents, with and without calibrated priors, across relative cost parameters ρ. Each stacked bar shows the proportion of decision traces corresponding to different action patterns, with the reward R labeled above. Annotated percentages indicate the fraction of tasks where the agent attempts code execution before any unit tests. 0.5 1.0 2.0 4.0 = log d… view at source ↗
Figure 5
Figure 5. Figure 5: Pareto frontier of average reward under varying costs. Static strategies (test-first or code-first) achieve high reward only in limited regimes, whereas CTA-RL with estimated priors con￾sistently attains Pareto-optimal performance across cost settings. Jain et al., 2025), planning (Zhou et al., 2024; Liu et al., 2025), question answering (Yao et al., 2023; Eisenstein et al., 2025), and scientific research … view at source ↗
Figure 6
Figure 6. Figure 6: Example interaction trace on a 3-bag Pandora’s Box instance with priors (0.04, 0.68, 0.28) and discount factor γ = 0.2 with thinking mode disabled. In this setting, the model explores all bags before committing and follows a suboptimal verification order, rather than prioritizing the highest-probability option. A. Qualitative trace analysis of Pandora’s Box Problem We present representative interaction tra… view at source ↗
Figure 7
Figure 7. Figure 7: Example interaction trace on a 3-bag Pandora’s Box instance with priors (0.04, 0.68, 0.28) and discount factor γ = 0.2, where the model is not given access to the prior probabilities. In this setting, the model implicitly treats the bags as equally likely and follows a suboptimal strategy that deviates from the optimal policy. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example model reasoning trace on a 3-bag Pandora’s Box instance with priors (0.04, 0.68, 0.28) and discount factor γ = 0.2. The model explicitly compares the expected value of immediate guessing versus verification and then chooses to guess B immediately, which is the optimal strategy in this case. Key reasoning steps, including the explicit comparison between action value and exploration cost, are highlig… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for Pandora’s Box setting. E.3. Prompts for CODE Prompts used in the CODE setting are provided in Figures 11, 12, 13, and 14. F. Case study: Cost-Aware Decision Traces in CODE with CTA-RL and RL Figures 15 and 16 compare representative traces under a high relative code cost setting (ρ = 4.0). The RL model (trained without conditioning on explicit priors) tends to default to running unit tes… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt templates for QA. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: System prompt for CODE. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Continuation of the system prompt for CODE. Instruction Prompt Template for CODE (without estimated prior) You are given a CSV file {csv name}. Your task: {task description} Additional context: • No format likelihoods are provided. • Make reasonable default assumptions about the CSV format based on common conventions, unless you choose to verify them with unit tests. Reward parameters: • Unit test discoun… view at source ↗
Figure 13
Figure 13. Figure 13: Instruction prompt template specifying the CSV task, reward parameters, and constraints provided to the agent. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Instruction prompt template with estimated CSV format likelihoods, enabling the agent to use probabilistic defaults when trading off unit tests, code execution, and early commitment. Case study: Example thinking trace of RL model (without estimated priors) (ρ = 4.0) Content: <think> Okay, let’s tackle this problem. The user has a CSV file named ‘race tsv sas.tsv‘ and they want the minimum salary value, ex… view at source ↗
Figure 15
Figure 15. Figure 15: Example reasoning trace of an RL-trained model without explicit prior conditioning in the CSV exploration task. Despite operating under the same high relative code cost setting (ρ = 4.0), the model defaults to verification-first behavior based on surface cues (e.g., file extension) and does not explicitly reason about uncertainty or cost trade-offs, illustrating a lack of adaptive decision-making compared… view at source ↗
Figure 16
Figure 16. Figure 16: Example reasoning trace of the CTA-RL model on the CODE task (ρ = 4.0), illustrating cost-aware trade-offs between unit tests and code execution under a high relative code cost setting, while jointly reasoning about format uncertainty. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
read the original abstract

LLM agents are deployed in environments where they must interact to acquire information. In these scenarios, the agent must reason about inherent cost-uncertainty tradeoffs in how to act, such as when to stop exploring and commit to an answer. For instance, on a programming task, an agent might run the code it generates, or it might generate tests for that code snippet; the cost of writing and running a test is nonzero, but typically lower than the cost of running buggy code. In this work, we show that we can induce LLM agents to explicitly reason about balancing these cost-uncertainty tradeoffs, then act more optimally in their environments. We formalize multiple tasks, including retrieval-augmented QA and a file reading coding task, as sequential decision-making problems under uncertainty. Each problem has latent environment state that impacts the agent's performance. We introduce a framework called Calibrate-Then-Act (CTA), where we pass the agent an inferred prior about this environment state to enable it to act more optimally. This information qualitatively changes agent behavior, and adds environment sensitivity to the agent which is not learned via standard RL training. Our results on a synthetic task, QA, and file reading show that making cost-benefit tradeoffs explicit with CTA helps agents discover more optimal decision-making strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Calibrate-Then-Act (CTA), a framework that formalizes tasks such as retrieval-augmented QA and file-reading coding as POMDPs with latent environment states, infers a prior over that state, and passes it to an LLM agent to induce explicit reasoning about cost-uncertainty tradeoffs. The central claim is that this produces more optimal decision-making strategies on a synthetic task, QA, and file reading, and supplies environment sensitivity that cannot be achieved through standard RL training.

Significance. If the empirical claims are substantiated, the work could provide a lightweight way to equip LLM agents with explicit cost-benefit calibration in uncertain interactive settings, offering a potential complement to pure RL for applications where exploration costs matter.

major comments (1)
  1. Abstract: the assertion that CTA 'adds environment sensitivity to the agent which is not learned via standard RL training' is load-bearing for the paper's novelty claim yet unsupported. No RL baseline (PPO, REINFORCE, or equivalent) trained on the identical synthetic, QA, or file-reading environments is reported, so it is impossible to determine whether the observed cost-sensitive behaviors are unique to CTA or could be acquired from reward signals alone.
minor comments (2)
  1. The abstract states that 'results on three tasks support improved strategies' but supplies no quantitative metrics, error bars, or baseline comparisons, which prevents assessment of effect size or statistical reliability.
  2. Clarify the precise format in which the inferred prior is communicated to the LLM and the exact prompting mechanism used to elicit the cost-benefit reasoning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide a point-by-point response to the major comment below.

read point-by-point responses
  1. Referee: Abstract: the assertion that CTA 'adds environment sensitivity to the agent which is not learned via standard RL training' is load-bearing for the paper's novelty claim yet unsupported. No RL baseline (PPO, REINFORCE, or equivalent) trained on the identical synthetic, QA, or file-reading environments is reported, so it is impossible to determine whether the observed cost-sensitive behaviors are unique to CTA or could be acquired from reward signals alone.

    Authors: We thank the referee for this observation. The claim in the abstract is intended to highlight that CTA enables explicit reasoning about environment-specific uncertainties by supplying a prior, which standard prompting or implicit RL optimization does not directly provide. Our empirical results on the synthetic task, QA, and file-reading demonstrate that agents using CTA exhibit cost-sensitive strategies that are absent in baseline LLM agents without the prior. We posit that acquiring similar sensitivity through standard RL would require substantial environment-specific training data and interactions, which is not the case for our zero-shot prior-based method. Nevertheless, we recognize that including RL baselines would provide stronger evidence. We will therefore revise the abstract to more precisely state that CTA induces environment sensitivity via explicit priors in a way that complements rather than replaces RL training, and we will expand the related work and discussion sections to elaborate on the distinctions from RL approaches. This revision will be incorporated in the next version of the manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in CTA derivation

full rationale

The paper formalizes tasks as POMDPs with latent environment state, then defines the CTA framework as passing an inferred prior to the LLM agent to induce explicit cost-uncertainty reasoning. This construction is presented as an external intervention rather than a self-referential loop; no equations reduce the claimed environment sensitivity to fitted parameters or prior outputs by definition. Empirical results on synthetic, QA, and file-reading tasks are reported as validation, with no load-bearing self-citations or ansatz smuggling identified in the derivation chain. The central claim of qualitative change beyond standard RL remains an empirical assertion rather than a definitional identity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM agents can productively use an externally supplied prior to reason about costs; no free parameters or new physical entities are introduced in the abstract description.

axioms (1)
  • domain assumption LLM agents can use a provided prior on latent environment state to reason about cost-uncertainty tradeoffs and change behavior beyond what standard RL training achieves
    This premise is required for the CTA framework to produce the claimed qualitative change in agent behavior.

pith-pipeline@v0.9.0 · 5762 in / 1264 out tokens · 50148 ms · 2026-05-21T12:30:32.773876+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Exploration and Exploitation Errors Are Measurable for Language Model Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.

  2. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    org/CorpusID:281705844

    URL https://api.semanticscholar. org/CorpusID:281705844. Agarwal, D., Majumder, B. P., Adamson, R., Chakravorty, M., Gavireddy, S. R., Parashar, A., Surana, H., Mishra, B. D., McCallum, A., Sabharwal, A., et al. Open- ended Scientific Discovery via Bayesian Surprise.arXiv preprint arXiv:2507.00310, 2025. Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao,...

  2. [2]

    ISBN 979-8-89176-256-5

    URL https://www.sciencedirect.com/ science/article/pii/S0022000002918283. Chen, S., Chen, X., Huang, Y ., Xie, R., and Dhingra, B. When greedy wins: Emergent exploitation bias in meta-bandit llm training.ArXiv, abs/2509.24923, 2025a. URL https://api.semanticscholar. org/CorpusID:281674231. Chen, W., Yuan, J., Qian, C., Yang, C., Liu, Z., and Sun, M. Optim...

  3. [3]

    findings-acl.601/

    URL https://aclanthology.org/2025. findings-acl.601/. Choi, J., Bansal, M., and Stengel-Eskin, E. Language mod- els identify ambiguities and exploit loopholes. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 32991–33006, 2025. Cole, J. R., Zhang, M. J., Gillick, D., Eisenschlos, J. M., Dhingra, B., and Eisen...

  4. [4]

    Learning how hard to think: Input-adaptive allocation of lm computation.arXiv preprint arXiv:2410.04707,

    URL https://openreview.net/forum? id=x2W2dKdNI8. Damani, M., Shenfeld, I., Peng, A., Bobu, A., and An- dreas, J. Learning how hard to think: Input-adaptive allocation of lm computation.ArXiv, abs/2410.04707,

  5. [5]

    org/CorpusID:273186996

    URL https://api.semanticscholar. org/CorpusID:273186996. Deng, M., Huang, L., Fan, Y ., Zhang, J., Ren, F., Bai, J., Yang, F., Miao, D., Yu, Z., Wu, Y ., Zhang, Y ., Teng, F., Wan, Y ., Hu, S., Li, Y ., Jin, X., Hu, C., Li, H., Fu, Q., Zhong, T., Wang, X., Tang, X., Tang, N., Wu, C., and Luo, Y . InteractComp: Evaluating Search Agents With Ambiguous Queri...

  6. [6]

    Ellie Pavlick and Tom Kwiatkowski

    URL https://api.semanticscholar. org/CorpusID:282401680. Desai, S. and Durrett, G. Calibration of pre-trained trans- formers. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Proceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pp. 295–302, Online, November 2020. Association for Computational Linguistics. ...

  7. [7]

    Elfleet, M

    URL https://openreview.net/forum? id=2vDJiGUfhV. Elfleet, M. and Chollet, M. Investigating the Impact of Multimodal Feedback on User-Perceived Latency and Immersion with LLM-Powered Embodied Conver- sational Agents in Virtual Reality. InIVA, pp. 12:1– 12:9, 2024. URL https://doi.org/10.1145/ 3652988.3673965. Grand, G., Pepe, V ., Andreas, J., and Tenenbau...

  8. [8]

    GX-Chen, A., Lin, D., Samiei, M., Precup, D., Richards, B

    URL https://openreview.net/forum? id=dIEeOwrmOe. GX-Chen, A., Lin, D., Samiei, M., Precup, D., Richards, B. A., Fergus, R., and Marino, K. Language agents mir- ror human causal reasoning biases. how can we help them think like scientists?ArXiv, abs/2505.09614,

  9. [9]

    org/CorpusID:278602122

    URL https://api.semanticscholar. org/CorpusID:278602122. Handa, K., Gal, Y ., Pavlick, E., Goodman, N., Andreas, J., Tamkin, A., and Li, B. Z. Bayesian preference elicitation with language models.arXiv preprint arXiv:2403.05534, 2024. Hennig, L., Tornede, T., and Lindauer, M. Towards lever- aging AutoML for sustainable deep learning: A multi- objective HP...

  10. [10]

    URL https://openreview.net/forum? id=jKN1pXi7b0. Jain, A. K., Gonzalez-Pumariega, G., Chen, W., Rush, A. M., Zhao, W., and Choudhury, S. Multi-turn code generation through single-step rewards. InForty- second International Conference on Machine Learning,

  11. [11]

    10 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Ji, S

    URL https://openreview.net/forum? id=aJeLhLcsh0. 10 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Ji, S. and Carin, L. Cost-sensitive feature acquisition and classification.Pattern Recognition, 40(5):1474–1485, 2007. K¨arkk¨ainen, K., Kachuee, M., Goldstein, O., and Sar- rafzadeh, M. Cost-sensitive feature-value acquisition using feature releva...

  12. [12]

    org/CorpusID:282064346

    URL https://api.semanticscholar. org/CorpusID:282064346. Lalai, H. N., Shah, R. S., Pei, J., Varma, S., Wang, Y .-C., and Emami, A. The world according to LLMs: How geographic origin influences LLMs’ entity deduction ca- pabilities. InSecond Conference on Language Modeling,

  13. [13]

    URL https://openreview.net/forum? id=hJtvCfDfs1. Li, B. Z., Kim, B., and Wang, Z. Questbench: Can LLMs ask the right question to acquire information in reasoning tasks? InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Bench- marks Track, 2025. URL https://openreview. net/forum?id=gpwA9aZLTZ. Li, Y . and Oliva, J...

  14. [14]

    CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

    URL https://api.semanticscholar. org/CorpusID:283933928. Liu, J., Qian, C., Su, Z., Zong, Q., Huang, S., He, B., and Fung, Y . R. CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic En- vironments for LLM Tool-Use Agents.arXiv preprint arXiv:2511.02734, 2025. Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., and Hajishi...

  15. [15]

    emnlp-main.466/

    URL https://aclanthology.org/2020. emnlp-main.466/. Mohri, C. and Hashimoto, T. Language models with confor- mal factuality guarantees. InProceedings of the 41st In- ternational Conference on Machine Learning, pp. 36029– 36047, 2024. Monea, G., Bosselut, A., Brantley, K., and Artzi, Y . LLMs Are In-Context Bandit Reinforcement Learners.arXiv preprint arXi...

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URL https://proceedings.neurips. 11 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents cc/paper_files/paper/2023/file/ ef0164c1112f56246224af540857348f-Paper-Datasets_ and_Benchmarks.pdf. Shaikh, O., Mozannar, H., Bansal, G., Fourney, A., and Horvitz, E. Navigating rifts in human-LLM ground- ing: Study and benchmark. In Che, W., Nabende, J., Shutova...

  17. [17]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    doi: 10.18653/v1/2025.acl-long.887. URL https: //aclanthology.org/2025.acl-long.887/. Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Yu, K., Nguyen, M. N., Liu, L., Got- tlieb, E., Lam, M., Lu, Y ., Cho, K., Wu, J., Li, F.- F., Wang, L., Choi, Y ., and Li, M. RAGEN: Un- derstanding Self-Evolution in LLM Agents via Multi- Turn Reinforcement Lea...

  18. [18]

    org/CorpusID:259224900

    URL https://api.semanticscholar. org/CorpusID:259224900. Wu, S., Galley, M., Peng, B., Cheng, H., Li, G., Dou, Y ., Cai, W., Zou, J., Leskovec, J., and Gao, J. CollabLLM: From passive responders to active collaborators. InForty- second International Conference on Machine Learning,

  19. [19]

    Xiong, M., Hu, Z., Lu, X., LI, Y ., Fu, J., He, J., and Hooi, B

    URL https://openreview.net/forum? id=DmH4HHVb3y. Xiong, M., Hu, Z., Lu, X., LI, Y ., Fu, J., He, J., and Hooi, B. Can LLMs express their uncertainty? an empirical evalu- ation of confidence elicitation in LLMs. InThe Twelfth International Conference on Learning Representations,

  20. [20]

    Steering LLM reasoning through bias-only adaptation

    URL https://openreview.net/forum? id=gjeQKFxFpZ. Xu, Y ., Chen, Z., and Wen, Z. EcoTune: Token- efficient multi-fidelity hyperparameter optimization for large language model inference. In Christodoulopou- los, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, pp. 7...

  21. [21]

    emnlp-main.394/

    URL https://aclanthology.org/2025. emnlp-main.394/. 12 Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Yang, H., Yue, S., and He, Y . Auto-GPT for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. SWE-agent: Age...

  22. [22]

    Zadrozny, B

    URL https://openreview.net/forum? id=WE_vluYUL-X. Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. InProceed- ings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699, 2002. Zhang, M. J. and Choi, E. Clarify when necessary: Resolv- ing ambiguity thro...

  23. [23]

    findings-naacl.306/

    URL https://aclanthology.org/2025. findings-naacl.306/. Zhang, M. J., Knox, W. B., and Choi, E. Modeling fu- ture conversation turns to teach LLMs to ask clarify- ing questions. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=cwuSAR7EKd. Zhou, L., Shi, J., Gao, J., and Wang, D. Credit-bud...

  24. [24]

    Then, I know B has the prize

    The result is YES (probability 0.68). Then, I know B has the prize. So, I can guess B at t=1, which would give me a reward of 1∗0.2 1 = 0.2

  25. [25]

    Then, B does not have the prize

    The result is NO (probability 1−0.68 = 0.32 ). Then, B does not have the prize. Now, I have to decide whether to guess or verify another bag. But if I get NO on B, then the remaining bags are A and C. Their probabilities would be updated based on the fact that B is not the correct one. Wait, but the problem says that the bags are independent. Wait, actual...

  26. [26]

    file.csv

    UNIT TESTS Run unit tests to debug CSV formatting assumptions. Unit test outputs are perfectly reliable. Available unit tests: •test delimiter(path)→ {‘,’, ‘;’, ‘\t’} •test quotechar(path)→ {‘”’, ‘”} •test skiprows(path)→ {0, 1} Format (NO code fences): UNIT TESTS: test delimiter("file.csv"), test quotechar("file.csv") You may include multiple unit tests ...

  27. [27]

    • Enclose code in‘‘‘python

    CODE Write Python code toward solving the task using your current assumptions about the CSV format. • Enclose code in‘‘‘python ... ‘‘‘ • You may import pandas as pd and read the file with: pd.read csv(filepath, delimiter=..., quotechar=..., skiprows=...) • Do NOT print the entire CSV . • If your code computes the final result, print it to stdout so it can...

  28. [28]

    race tsv sas.tsv

    ANSWER Provide the final answer to the task and end the conversation. Format exactly:ANSWER: <your answer> The conversation ends immediately after you provide ANSWER. Reward: • LetUbe the total number of unit tests used. • LetCbe the total number of CODE actions taken. • Final reward=correctness×(d unit)U ×(d code)C. • Discount factors represent cost mult...