pith. sign in

arxiv: 2606.23937 · v1 · pith:BJSTLSDCnew · submitted 2026-06-22 · 💻 cs.CL · cs.AI· cs.LG

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

Pith reviewed 2026-06-26 08:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords retrieval metricspolicy classificationlong-horizon agentstool-useexact-match recalltau-benchmacro-F1downstream utility
0
0 comments X

The pith

Exact-match clause recall underestimates downstream policy utility in tool-use agents

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether exact-match retrieval recall accurately proxies the usefulness of policy clauses for a downstream decision model in long-horizon tool-use agents. Using tuned Qwen2.5-3B and 7B classifiers on tau-bench states, it replaces gold policy clauses with top-ranked retrieved ones and measures macro-F1. Despite exact matches at rank 1 in only 7% of airline states, retrieved clauses yield macro-F1 scores statistically indistinguishable from gold (0.58 vs 0.60 for the 3B model). Mismatched-policy and no-policy controls score lower at 0.32 and 0.21. The results indicate that recall alone can mislead about policy signal, so evaluation should incorporate retrieved policies in the classification loop.

Core claim

When the benchmark-designated policy clause is replaced by the top-ranked clause retrieved from decision-time context, the primary 3B classifier obtains macro-F1 0.58 with retrieved clauses versus 0.60 with gold clauses (Delta=-0.02). Although the exact governing clause is retrieved at rank 1 for only 7% of states, mismatched-policy and no-policy controls score 0.32 and 0.21. The same qualitative pattern appears with a second retriever and at 7B scale, while varying across fine-tuning configurations.

What carries the argument

Policy classification performance measured by macro-F1 on tau-bench states, comparing gold policy clauses against top-retrieved clauses from decision-time context.

If this is right

  • Exact-match recall at rank 1 occurs for only 7% of states yet does not produce a detectable macro-F1 drop.
  • Mismatched-policy controls score 0.32 macro-F1, confirming that some policy signal is captured even without exact matches.
  • No-policy controls score 0.21, lower than both gold and retrieved conditions.
  • The pattern of near-equivalent performance holds across two retrievers and both 3B and 7B classifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retrieval systems for agents could be optimized for semantic or approximate policy matches rather than exact clause identity.
  • The finding suggests that structured state representations plus retrieved policies may suffice for classification even when exact recall is low.
  • Proxy evaluations of this type could be extended to measure policy signal in other long-horizon decision benchmarks.

Load-bearing premise

The policy classification task performed by the tuned Qwen2.5-3B/7B models on tau-bench states serves as a faithful proxy for the policy signal that would be available to an actual downstream decision model in a long-horizon agent.

What would settle it

Directly measuring long-horizon agent success rates or action accuracy when the decision model receives retrieved clauses versus gold clauses in the full tau-bench loop.

Figures

Figures reproduced from arXiv: 2606.23937 by Juan Pablo De la Cruz Weinstein, Tianyu Ding.

Figure 1
Figure 1. Figure 1: Gold-injection exact-match diagnostic. The curve is the structured−raw classifier macro-F1 gap as a function of effective exact-gold access rate (offline gold-injection sweep: gold clause for a fraction p of states, MiniLM top-1 retrieved clause otherwise; 95% CI band); it crosses zero only near recall ≈ 0.75. Triangles mark achievable exact-match recall@5 for every configuration (two domains × two query p… view at source ↗
Figure 2
Figure 2. Figure 2: Policy-clause identifiability profile (recall@k). For each domain, recall@k of the applicable gold policy clause as a function of k (log scale), for all four off-the-shelf retrievers, under the trajectory-context query protocol. Dotted line = the ≈ 0.75 diagnostic crossing. On airline (pool 122) recall stays far below this crossing until k =50; on retail (pool 51) it reaches the crossing only near k =20–50… view at source ↗
Figure 3
Figure 3. Figure 3: Three-way decision macro-F1 by classifier input under the identical SFT recipe (3 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
read the original abstract

Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B classifiers. Under gold-policy conditioning, a compact structured state improves macro-F1 over raw trajectories by 0.13-0.17 after tuning. We then replace the benchmark-designated policy clause with the top-ranked clause retrieved from decision-time context. Although the exact governing clause is retrieved at rank 1 for only 7% of airline states, the primary 3B classifier obtains macro-F1 0.58 with retrieved clauses versus 0.60 with gold clauses (Delta=-0.02, task-cluster 95% CI [-0.23,+0.21]); mismatched-policy and no-policy controls score 0.32 and 0.21. We do not detect a macro-F1 difference between retrieved and gold clauses in this configuration, although the interval remains too wide to establish non-inferiority. The same qualitative pattern appears with a second retriever and at 7B, while varying across fine-tuning configurations. These results indicate that exact-match clause recall can underestimate downstream policy utility in this benchmark setting, motivating evaluation with retrieved policies in the classification loop rather than recall alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that exact-match clause recall underestimates downstream policy utility for pre-action classification in tau-bench. Using tuned Qwen2.5-3B/7B classifiers, macro-F1 reaches 0.58 with top-retrieved clauses versus 0.60 with gold clauses (Delta=-0.02, task-cluster 95% CI [-0.23,+0.21]) despite only 7% rank-1 exact-match recall; mismatched-policy and no-policy controls score 0.32 and 0.21. The same pattern appears with a second retriever and at 7B scale (varying by fine-tuning configuration), leading to the conclusion that evaluation should use retrieved policies inside the classification loop rather than recall alone.

Significance. If the macro-F1 scores serve as a valid proxy, the work identifies a potential disconnect between standard retrieval metrics and actual policy signal in long-horizon agents, which could shift evaluation practices toward end-to-end utility measures. The control conditions and cross-configuration consistency provide some empirical grounding, though the wide CI prevents strong claims of equivalence.

major comments (3)
  1. [Abstract] Abstract: the central claim that exact-match recall underestimates policy utility rests on the assumption that macro-F1 from the tuned Qwen2.5-3B/7B classifiers is a faithful proxy for the policy signal available to an actual downstream decision model; no direct experiments measuring agent success rate, trajectory length, or error recovery when retrieved clauses are inserted into the real decision loop are reported.
  2. [Abstract] Abstract: the reported Delta=-0.02 with wide 95% CI [-0.23,+0.21] (explicitly too wide to establish non-inferiority) combined with the absence of full methods, data splits, tuning details, or state counts limits verification of the key empirical result that retrieved and gold clauses produce statistically indistinguishable F1.
  3. [Abstract] Abstract: the statement that the qualitative pattern 'varies across fine-tuning configurations' is noted without quantification or analysis of which configurations drive the variation, weakening robustness claims for the main finding.
minor comments (2)
  1. The term 'task-cluster 95% CI' is used without defining the clusters or the clustering procedure.
  2. No information is supplied on the number of states evaluated, number of runs, or how the 7% rank-1 recall was computed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments on our manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: the central claim that exact-match clause recall underestimates downstream policy utility rests on the assumption that macro-F1 from the tuned Qwen2.5-3B/7B classifiers is a faithful proxy for the policy signal available to an actual downstream decision model; no direct experiments measuring agent success rate, trajectory length, or error recovery when retrieved clauses are inserted into the real decision loop are reported.

    Authors: Our study uses the macro-F1 of the tuned classifier as a proxy for the availability of policy signal to a downstream model. The substantial drop in F1 for the mismatched-policy (0.32) and no-policy (0.21) controls indicates that the metric is sensitive to policy correctness. We view direct agent-loop experiments as complementary but beyond the current scope, which focuses on retrieval evaluation. We will add a sentence clarifying the proxy nature and scope in the revised abstract and discussion. revision: partial

  2. Referee: the reported Delta=-0.02 with wide 95% CI [-0.23,+0.21] (explicitly too wide to establish non-inferiority) combined with the absence of full methods, data splits, tuning details, or state counts limits verification of the key empirical result that retrieved and gold clauses produce statistically indistinguishable F1.

    Authors: The abstract already notes that the CI is too wide to establish non-inferiority. Full details on methods, splits, tuning, and state counts are provided in the main text. To address verifiability concerns, we will append a brief summary of the dataset size and primary hyperparameters to the abstract. revision: yes

  3. Referee: the statement that the qualitative pattern 'varies across fine-tuning configurations' is noted without quantification or analysis of which configurations drive the variation, weakening robustness claims for the main finding.

    Authors: We concur that quantification is needed. The revision will include a table or text reporting the F1 deltas for each fine-tuning configuration at both 3B and 7B scales, along with a short analysis of observed variation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on independent classification task

full rationale

The paper's central result is an empirical comparison of macro-F1 scores for Qwen2.5 classifiers trained and evaluated on tau-bench states under gold-policy vs. retrieved-policy conditioning. No equations, fitted parameters, or self-citations are used to derive the reported deltas; the F1 values (0.58 vs 0.60) are direct outputs of standard fine-tuning and evaluation on held-out states. The claim that exact-match recall can underestimate policy utility follows from these measurements rather than reducing to them by construction. The proxy assumption (classification F1 as stand-in for downstream agent utility) is an interpretive limitation but does not create a self-referential loop in the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the assumption that the chosen benchmark and classifiers validly test policy signal; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The tau-bench airline task and Qwen2.5 classifiers constitute a representative test of policy utility for long-horizon tool-use agents.
    The entire experimental contrast depends on this representativeness.

pith-pipeline@v0.9.1-grok · 5779 in / 1212 out tokens · 37785 ms · 2026-06-26T08:03:03.039637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references

  1. [1]

    Beyond Token-level Answer Equivalence for Question Answering Evaluation , author=

    Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  2. [2]

    2025 , eprint=

    How important is Recall for Measuring Retrieval Quality? , author=. 2025 , eprint=

  3. [3]

    2024 , eprint=

    -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. 2024 , eprint=

  4. [4]

    2024 , eprint=

    ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities , author=. 2024 , eprint=

  5. [5]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  6. [6]

    Advances in Neural Information Processing Systems , year=

    Training Language Models to Follow Instructions with Human Feedback , author=. Advances in Neural Information Processing Systems , year=

  7. [7]

    2026 , eprint=

    The Verifier Tax: Horizon-Dependent Safety-Success Tradeoffs in Tool-Using LLM Agents , author=. 2026 , eprint=

  8. [8]

    2026 , eprint=

    Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents , author=. 2026 , eprint=

  9. [9]

    2023 , eprint=

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations , author=. 2023 , eprint=

  10. [10]

    2026 , note=

    When Interventions Don't Transfer: A Cross-Project Postmortem of Reward and Control Failures in Tool-Use and Grounded-Generation Agents , author=. 2026 , note=

  11. [11]

    Proceedings of EMNLP-IJCNLP , year=

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. Proceedings of EMNLP-IJCNLP , year=

  12. [12]

    2024 , eprint=

    Sufficient Context: A New Lens on Retrieval Augmented Generation Systems , author=. 2024 , eprint=

  13. [13]

    2025 , eprint=

    ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning , author=. 2025 , eprint=

  14. [14]

    2026 , eprint=

    Solver-Aided Verification of Policy Compliance in Tool-Augmented LLM Agents , author=. 2026 , eprint=

  15. [15]

    2023 , eprint=

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle=. 2023 , eprint=

  16. [16]

    Findings of EMNLP , year=

    Knowing What You Know: Calibrating Dialogue Belief State Distributions via Ensembles , author=. Findings of EMNLP , year=. 2010.02586 , archivePrefix=

  17. [17]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , eprint=

  18. [18]

    2024 , eprint=

    Qwen2.5 Technical Report , author=. 2024 , eprint=

  19. [19]

    Madhusudhan, Nishanth and others , year=. Do. 2407.16221 , archivePrefix=

  20. [20]

    2023 , eprint=

    Large Language Models Should Ask Clarifying Questions to Increase Confidence in Generated Code , author=. 2023 , eprint=

  21. [21]

    2026 , eprint=

    Not All Skills Help: Measuring and Repairing Agent Knowledge , author=. 2026 , eprint=

  22. [22]

    2025 , eprint=

    Towards Enforcing Company Policy Adherence in Agentic Workflows , author=. 2025 , eprint=

  23. [23]

    2024 , eprint=

    GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning , author=. 2024 , eprint=

  24. [24]

    2025 , eprint=

    ARPaCCino: An Agentic-RAG for Policy as Code Compliance , author=. 2025 , eprint=

  25. [25]

    2025 , eprint=

    ^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. 2025 , eprint=

  26. [26]

    2026 , eprint=

    Beyond Similarity: Task-Aligned Retrieval for Language Models , author=. 2026 , eprint=

  27. [27]

    2025 , eprint=

    Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models , author=. 2025 , eprint=

  28. [28]

    2026 , eprint=

    Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents , author=. 2026 , eprint=

  29. [29]

    2025 , eprint=

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents , author=. 2025 , eprint=

  30. [30]

    2026 , eprint=

    MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents , author=. 2026 , eprint=