pith. sign in

arxiv: 2606.17541 · v1 · pith:6KU6JIV3new · submitted 2026-06-16 · 💻 cs.LG · cs.AI

Offline Preference-Based Trajectory Evaluation

Pith reviewed 2026-06-27 02:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords offline evaluationtrajectory evaluationpreference-based metricsagent benchmarkstie reductiondiscriminative powerbenchmark saturation
0
0 comments X

The pith

Preference-based comparison of full trajectories reduces ties in offline agent evaluations from 75% to 35%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard success-based metrics for evaluating agent trajectories discard information about partial progress, resulting in tied comparisons on about 75% of instances across benchmarks. The paper proposes comparing trajectories directly via temporal preferences on progress and time-to-return profiles instead. This approach cuts ties to roughly 35%, which increases the ability to distinguish between systems, stabilizes rankings, and makes better use of limited evaluation data. A reader would care because it offers a way to extract more signal from existing trajectory data without needing to run more experiments or collect additional labels.

Core claim

Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles. We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power,

What carries the argument

Temporal preferences over progress and time-to-return profiles, which enable direct comparison of entire trajectories rather than reducing them to binary terminal outcomes.

If this is right

  • Standard success-based metrics produce tied comparisons on roughly 75% of instances.
  • Trajectory-aware preferences reduce ties to roughly 35%.
  • Improved discriminative power allows better distinction between agent systems.
  • Ranking stability and data efficiency both increase with the preference-based method.
  • Benchmark saturation may partly result from the choice of evaluation measure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending this method to new domains could reveal similar inefficiencies in other evaluation settings like robotics or dialogue systems.
  • Automating the definition of temporal preferences might further reduce reliance on human judgments for scaling.
  • Combining trajectory preferences with other metrics could lead to hybrid evaluation frameworks that balance simplicity and informativeness.

Load-bearing premise

That temporal preferences over progress and time-to-return profiles can be defined and applied consistently across tasks without introducing bias or requiring task-specific human judgments that are difficult to scale.

What would settle it

A replication study on the same or similar benchmarks where the preference-based method fails to reduce tie rates below 60% or where the resulting rankings show no improvement in predicting held-out performance differences.

Figures

Figures reproduced from arXiv: 2606.17541 by Fernando Diaz.

Figure 1
Figure 1. Figure 1: Trajectory ties under success rate. (a) Two unsuccessful trajectories can be distinguished by partial returns. (b) Two successful trajectories can be distinguished by how they accumulate return over time. metrics across machine learning and natural language processing conferences is increasing. We found that the percentage of abstracts in papers published at the NeurIPS Datasets and Benchmarks track mentio… view at source ↗
Figure 2
Figure 2. Figure 2: Trajectory preferences based on time to return. Evaluation operates by comparing pairs of trajectories and aggregating preferences within trajectories. (a) Lexicographic Return (LR) derives a preference from the time to reach the earliest non-tied return. (b) Return-Paired Preference (RPP) integrates time-to-return across all return levels. (c) Interval-Paired Preference (IPP) compares the time between ret… view at source ↗
Figure 3
Figure 3. Figure 3: Inter-metric similarity: (a) Bump chart showing the ranking of systems across several measures for AgentBoard ALFWorld runs; charts for other tasks can be found in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity: (a) Distribution of measure values. (b) Discriminative power as a function of mean task success rate. Each point is one benchmark task. Curves are LOWESS fits. LR and IPP track RPP while SPL tracks SR; lines removed for clarity. Tie rate. Table 2d shows the tie rates across measures. As mentioned in Section 1, SR produces ties for 74.9% of instance comparisons, while SPL reduces ties to 63.4% … view at source ↗
Figure 5
Figure 5. Figure 5: Data efficiency. (a) Accuracy of model preferences based on each subsample fraction with respect to model preferences based on the full-data. (b) Fraction of oracle pairs that are both correctly ordered and statistically significant (Benjamini-Hochberg correction) as a function of the fraction of instances used. SR and PR omitted due to poor performance; LR and IPP omitted for clarity. 5.4 Data efficiency … view at source ↗
Figure 6
Figure 6. Figure 6: Bump charts for tasks. Each line is one model; rank 1 is top. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Preference distributions across all benchmark datasets. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Preference preservation (self-reference) — AgentBoard tasks. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Preference preservation (self-reference) — TAC and OHI tasks. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Preference preservation (self-reference) — TALES tasks ( [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Preference preservation (self-reference) — SGRL tasks. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Oracle sign accuracy as a function of the fraction of instances used to compute each metric. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Oracle significant accuracy (FWER, Holm correction) as a function of the fraction of [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Oracle significant accuracy (FDR, Benjamini-Hochberg correction) as a function of the [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
read the original abstract

Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles. We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. Our results suggest that benchmark saturation, often attributed to poor data collection or problem difficulty, may also be explained by the choice of evaluation measure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes preference-based trajectory evaluation for offline assessment of agentic systems. It claims that standard success-based metrics, which collapse trajectories to terminal success, produce tied comparisons on roughly 75% of instances across diverse agentic and interactive benchmarks, while trajectory-aware preferences over progress and time-to-return profiles reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. The authors suggest that benchmark saturation may partly result from the choice of evaluation measure.

Significance. If the empirical results hold with proper validation, the work could meaningfully improve evaluation practices in reinforcement learning and agentic AI by utilizing more trajectory information and mitigating statistical inefficiency from ties. The cross-benchmark empirical observation is a potential strength for practical impact.

major comments (1)
  1. [Abstract] Abstract: The central empirical claim reports specific tie-reduction percentages (75% to 35%) but supplies no methodological details on preference definition, comparison procedure, benchmark selection, tie-counting criteria, validation procedures, or error analysis. This absence is load-bearing for assessing whether the data support the claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater methodological transparency in the abstract. We address this directly below and will revise the manuscript to improve self-containment of the central claim while preserving the abstract's brevity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim reports specific tie-reduction percentages (75% to 35%) but supplies no methodological details on preference definition, comparison procedure, benchmark selection, tie-counting criteria, validation procedures, or error analysis. This absence is load-bearing for assessing whether the data support the claim.

    Authors: We agree that the abstract, constrained by length, omits explicit methodological details and that this limits immediate assessment of the claim. The full definitions appear in Section 3 (temporal preferences are defined over normalized progress curves and time-to-return profiles using a Bradley-Terry model with a fixed margin) and Section 4 (benchmarks comprise WebArena, ALFWorld, BabyAI, and three additional interactive environments; pairwise comparisons are performed on all trajectory pairs per task; ties are counted when success labels match or when preference scores differ by less than the margin; results are averaged over 5 seeds with bootstrap confidence intervals). We will revise the abstract to add one sentence summarizing the preference construction and benchmark scope, and we will ensure the results section explicitly cross-references these procedures. No separate error analysis beyond the reported confidence intervals was performed; if the referee considers additional sensitivity checks necessary we can add them. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical observation

full rationale

The manuscript reports an empirical finding on tie rates (75% vs 35%) across benchmarks when comparing success-based metrics to preference-based trajectory evaluation. No equations, derivations, fitted parameters, or self-citations are invoked as load-bearing steps in the provided text. The result is a direct data observation rather than a constructed prediction or self-referential definition, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the unspecified 'diverse agentic and interactive benchmarks' and on the assumption that the preference profiles capture meaningful distinctions without additional validation.

axioms (1)
  • domain assumption The selected benchmarks are representative of the broader space of agentic and interactive tasks.
    Findings are stated to hold 'across diverse agentic and interactive benchmarks' but no selection criteria or coverage argument is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5626 in / 1170 out tokens · 52452 ms · 2026-06-27T02:36:03.659146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 21 canonical work pages

  1. [1]

    Agarwal, M

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 29304–29320. Curran Associates, Inc., 2021. URL https://proceed...

  2. [2]

    Akhtar, A

    M. Akhtar, A. Reuel, P. Soni, S. Ahuja, P. S. Ammanamanchi, R. Rawal, V . Zouhar, S. Yadav, C. Whitehouse, D. Ki, J. Mickel, L. Choshen, M. Šuppa, J. Batzner, J. Chim, J. Sania, Y . Long, H. A. Rahmani, C. Knight, Y . Nan, J. Raj, Y . Fan, S. Singh, S. Sahoo, E. Habba, U. Gohar, S. Pawar, R. Scholz, A. Subramonian, J. Ni, M. Kochenderfer, S. Koyejo, M. Sa...

  3. [3]

    Anderson, A

    P. Anderson, A. X. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir. On evaluation of embodied navigation agents. CoRR, abs/1807.06757, 2018. URLhttp://arxiv.org/abs/1807.06757

  4. [4]

    Ashury Tahan, A

    S. Ashury Tahan, A. Gera, B. Sznajder, L. Choshen, L. Ein-Dor, and E. Shnarch. Label- efficient model selection for text generation. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8384–8402, Bangkok, Thailand, Aug. 2024. Association f...

  5. [5]

    Chapelle, T

    O. Chapelle, T. Joachims, F. Radlinski, and Y . Yue. Large-scale validation and analysis of interleaved search evaluation.ACM Trans. Inf. Syst., 30(1), Mar. 2012. ISSN 1046-8188. doi: 10.1145/2094072.2094078. URLhttps://doi.org/10.1145/2094072.2094078

  6. [6]

    Chiang, L

    W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. I. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: an open platform for evaluating llms by human preference. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  7. [7]

    Chouldechova, A

    A. Chouldechova, A. F. Cooper, S. Barocas, A. Palia, D. Vann, and H. Wallach. Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming. In The Thirty-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track, 2025. URLhttps://openreview.net/forum?id=d7hqAhLvWG

  8. [8]

    Cohen.Statistical Power Analysis for the Behavioral Sciences

    J. Cohen.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Hillsdale, NJ, 2nd edition, 1988

  9. [9]

    C. Z. Cui, X. Yuan, Z. Xiao, P. Ammanabrolu, and M.-A. Côté. Tales: Text adventure learning environment suite, 2025. URLhttps://arxiv.org/abs/2504.14128

  10. [10]

    Diaz and A

    F. Diaz and A. Ferraro. Offline retrieval evaluation without evaluation metrics. InProceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 599–609, New York, NY , USA, 2022. Association for Computing Machinery. URLhttps://doi.org/10.1145/3477495.3532033

  11. [11]

    F. Diaz, M. D. Ekstrand, and B. Mitra. Recall, robustness, and lexicographic evaluation.ACM Trans. Recomm. Syst., 4(1), July 2025. doi: 10.1145/3728373. URL https://doi.org/10. 1145/3728373

  12. [12]

    Frederick, G

    S. Frederick, G. Loewenstein, and T. O’Donoghue. Time discounting and time prefer- ence: A critical review.Journal of Economic Literature, 40(2):351–401, June 2002. doi: 10.1257/002205102320161311. URL https://www.aeaweb.org/articles?id=10.1257/ 002205102320161311

  13. [13]

    Ghosh, Y

    A. Ghosh, Y . Mai, G. Channing, and L. Choshen. AI evals are becoming the new compute bot- tleneck. EvalEval Coalition Blog, Apr. 2026. URL https://evalevalai.com/research/ 2026/04/29/eval-costs-bottleneck/

  14. [14]

    Henderson, R

    P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr

  15. [15]
  16. [16]

    Huang, J

    Y . Huang, J. Song, Q. Hu, F. Juefei-Xu, and L. Ma. Actracer: Active testing of large language model via multi-stage sampling.ACM Trans. Softw. Eng. Methodol., 35(3), Feb. 2026. ISSN 1049-331X. doi: 10.1145/3744340. URLhttps://doi.org/10.1145/3744340

  17. [17]

    Joachims

    T. Joachims. Optimizing search engines using clickthrough data. InKDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133–142, 2002. ISBN 1-58113-567-X. doi: http://doi.acm.org/10.1145/775047.775067

  18. [18]

    Kapoor, B

    S. Kapoor, B. Stroebl, P. Kirgis, N. Nadgir, Z. S. Siegel, B. Wei, T. Xue, Z. Chen, F. Chen, S. Utpala, F. Ndzomga, D. Oruganty, S. Luskin, K. Liu, B. Yu, A. Arora, D. Hahm, H. Trivedi, H. Sun, J. Lee, T. Jin, Y . Mai, Y . Zhou, Y . Zhu, R. Bommasani, D. Kang, D. Song, P. Henderson, Y . Su, P. Liang, and A. Narayanan. Holistic agent leaderboard: The missi...

  19. [19]

    URLhttps://openreview.net/forum?id=vUaY1t64ZZ

  20. [20]

    Kiela, M

    D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams. Dynabench: Rethinking benchmarking in NLP. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chak...

  21. [21]

    doi: 10.18653/v1/2021.naacl-main.324

    Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.324. URL https://aclanthology.org/2021.naacl-main.324/

  22. [22]

    Kossen, S

    J. Kossen, S. Farquhar, Y . Gal, and T. Rainforth. Active testing: Sample-efficient model evaluation. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 5753–

  23. [23]

    URL https://proceedings.mlr.press/v139/kossen21a

    PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/kossen21a. html

  24. [24]

    Y . Li, J. Ma, M. Ballesteros, Y . Benajiba, and G. Horwood. Active evaluation acquisition for efficient LLM benchmarking. InForty-second International Conference on Machine Learning,

  25. [25]

    URLhttps://openreview.net/forum?id=EHqQaBYYlE

  26. [26]

    C. Ma, J. Zhang, Z. Zhu, C. Yang, Y . Yang, Y . Jin, Z. Lan, L. Kong, and J. He. Agentboard: An analytical evaluation board of multi-turn llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 74325–74362. Curran Associates, Inc.,

  27. [27]

    doi:10.52202/079017-2365 , pages =

    doi: 10.52202/079017-2365. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/877b40688e330a0e2a3fc24084208dfa-Paper-Datasets_ and_Benchmarks_Track.pdf

  28. [28]

    Z. Ma, K. Ethayarajh, T. Thrush, S. Jain, L. Wu, R. Jia, C. Potts, A. Williams, and D. Kiela. Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 10351–10367. Curran Asso...

  29. [29]

    Maia Polo, L

    F. Maia Polo, L. Weber, L. Choshen, Y . Sun, G. Xu, and M. Yurochkin. tinybenchmarks: evaluating llms with fewer examples. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  30. [30]

    Maksai, F

    A. Maksai, F. Garcin, and B. Faltings. Predicting online performance of news recommender systems through richer evaluation metrics. InProceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, pages 179–186, New York, NY , USA, 2015. Association for Computing Machinery. ISBN 9781450336925. doi: 10.1145/2792838.2800184. URL https://doi.org/1...

  31. [31]

    Mandel and R

    J. Mandel and R. D. Stiehler. Sensitivity–a criterion for the comparison of methods of test. Journal of research of the National Bureau of Standards, 53:155, 1954. URL https://api. semanticscholar.org/CorpusID:52393909

  32. [32]

    Prudêncio and Adolfo Martínez-Usó and José Hernández-Orallo , keywords =

    F. Martínez-Plumed, R. B. Prudêncio, A. Martínez-Usó, and J. Hernández-Orallo. Item response theory in ai: Analysing machine learning classifiers at the instance level.Artificial Intelligence, 271:18–42, 2019. ISSN 0004-3702. doi: https://doi.org/10.1016/j.artint.2018.09.004. URL https://www.sciencedirect.com/science/article/pii/S0004370219300220

  33. [33]

    Moffat and J

    A. Moffat and J. Mackenzie. How much freedom does an effectiveness metric really have? Journal of the Association for Information Science and Technology, n/a(n/a), 2024. doi: https://doi.org/10.1002/asi.24874. URL https://asistdl.onlinelibrary.wiley.com/ doi/abs/10.1002/asi.24874

  34. [34]

    A. K. Mohankumar and M. Khapra. Active evaluation: Efficient NLG evaluation with few pairwise comparisons. In S. Muresan, P. Nakov, and A. Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8761–8781, Dublin, Ireland, May 2022. Association for Computational Linguist...

  35. [35]

    Munos, M

    R. Munos, M. Valko, D. Calandriello, M. G. Azar, M. Rowland, Z. D. Guo, Y . Tang, M. Geist, T. Mesnard, C. Fiegel, A. Michi, M. Selvi, S. Girgin, N. Momchev, O. Bachem, D. J. Mankowitz, D. Precup, and B. Piot. Nash learning from human feedback. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/ forum?id=Y5AmNYiyCQ. 13

  36. [36]

    F. Ndzomga. Efficient benchmarking of ai agents, 2026. URL https://arxiv.org/abs/ 2603.23749

  37. [37]

    Olteanu, S

    A. Olteanu, S. L. Blodgett, A. Balayn, A. Wang, F. Diaz, F. du Pin Calmon, M. Mitchell, M. Ekstrand, R. Binns, and S. Barocas. Rigor in ai: Doing rigorous ai work requires a broader, responsible ai-informed conception of rigor. InAdvances in Neural Information Processing Systems, 2025. URLhttps://arxiv.org/abs/2506.14652

  38. [38]

    S. Ott, A. Barbosa-Silva, K. Blagec, J. Brauner, and M. Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13 (1):6793, 2022. doi: 10.1038/s41467-022-34591-0. URL https://doi.org/10.1038/ s41467-022-34591-0

  39. [39]

    Peyrard, W

    M. Peyrard, W. Zhao, S. Eger, and R. West. Better than average: Paired evaluation of NLP systems. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2301–2315, Online, Aug. 2021. Association for Computational Lin...

  40. [40]

    Rodriguez, J

    P. Rodriguez, J. Barrow, A. Hoyle, J. P. Lalor, R. Jia, and J. Boyd-Graber. Evaluation examples are not equally informative: How should that change NLP leaderboards? In C. Zong, F. Xia, W. Li, and R. Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natur...

  41. [41]

    and Jia, Robin and Boyd-Graber, Jordan

    Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.346. URL https://aclanthology.org/2021.acl-long.346/

  42. [42]

    T. Sakai. Evaluating evaluation metrics based on the bootstrap. InProceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, pages 525–532, New York, NY , USA, 2006. Association for Computing Machinery. ISBN 1595933697. doi: 10.1145/1148170.1148261. URL https://doi.org/10. 1145/114817...

  43. [43]

    T. Sakai. Alternatives to bpref. InSIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 71–78, New York, NY , USA, 2007. ACM. ISBN 978-1-59593-597-7. doi: http://doi.acm.org/10.1145/ 1277741.1277756

  44. [44]

    Subramani, A

    N. Subramani, A. Gomez, and M. T. Diab. SimBA: Simplifying benchmark analysis using performance matrices alone. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13220– 13233, Suzhou, China, Nov. 2025. Association for Computational Linguistics. ISBN 979-8- ...

  45. [45]

    Sutton and A

    R. Sutton and A. Barto.Reinforcement Learning. MIT Press, 1998

  46. [46]

    Swamy, C

    G. Swamy, C. Dann, R. Kidambi, S. Wu, and A. Agarwal. A minimaximalist approach to rein- forcement learning from human feedback. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofProceedings of Machine Learning Resear...

  47. [47]

    O. Team. Openhands index: A comprehensive leaderboard for ai coding agents. https://index.openhands.dev, 2025

  48. [48]

    S. T. Truong, Y . Tu, P. Liang, B. Li, and S. Koyejo. Reliable and efficient amortized model- based evaluation. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=HDbWrsgkB9. 14 Year ACL EMNLP NeurIPS Main NeurIPS Data 2022 8.8% 8.9% 13.6% 4.9% 2023 10.5% 10.9% 13.5% 7.8% 2024 10.4% 12.6% 13.0% 11.1% 202...

  49. [49]

    Vania, P

    C. Vania, P. M. Htut, W. Huang, D. Mungra, R. Y . Pang, J. Phang, H. Liu, K. Cho, and S. R. Bow- man. Comparing test sets with item response theory. In C. Zong, F. Xia, W. Li, and R. Navigli, ed- itors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Pro...

  50. [50]

    M. N. V olkovs and R. S. Zemel. A flexible generative model for preference aggregation. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pages 479– 488, New York, NY , USA, 2012. Association for Computing Machinery. ISBN 9781450312295. doi: 10.1145/2187836.2187902. URLhttps://doi.org/10.1145/2187836.2187902

  51. [51]

    Wallach, M

    H. Wallach, M. Desai, A. F. Cooper, A. Wang, C. Atalla, S. Barocas, S. L. Blodgett, A. Chouldechova, E. Corvi, P. A. Dow, J. Garcia-Gathright, A. Olteanu, N. J. Pangakis, S. Reed, E. Sheng, D. Vann, J. W. Vaughan, M. V ogel, H. Washington, and A. Z. Jacobs. Position: Evaluating generative AI systems is a social science measurement challenge. In Forty-seco...

  52. [52]

    F. F. Xu, Y . Song, B. Li, Y . Tang, K. Jain, M. Bao, Z. Z. Wang, X. Zhou, Z. Guo, M. Cao, M. Yang, H. Y . Lu, A. Martin, Z. Su, L. Maben, R. Mehta, W. Chi, L. Jang, Y . Xie, S. Zhou, and G. Neubig. Theagentcompany: Benchmarking llm agents on consequential real world tasks,

  53. [53]

    A Use of binary metrics at ML and NLP conferences We used the OpenReview API to gather abstracts for NeurIPS, NeurIPS Datasets and Benchmarks, ACL, and EMNLP between 2022 and 2025

    URLhttps://arxiv.org/abs/2412.14161. A Use of binary metrics at ML and NLP conferences We used the OpenReview API to gather abstracts for NeurIPS, NeurIPS Datasets and Benchmarks, ACL, and EMNLP between 2022 and 2025. We then identified abstracts that contained references to any of: success rate, accuracy, exact match, task success, episode success, top-1...

  54. [54]

    Text Adventure Learning Environment Suite data [9] downloaded on 11 April 2026 from https://huggingface.co/datasets/PEARLS-Lab/TALES-Trajectories

    downloaded from https://github.com/TheAgentCompany/experiments/tree/main/ evaluation/1.0.0. Text Adventure Learning Environment Suite data [9] downloaded on 11 April 2026 from https://huggingface.co/datasets/PEARLS-Lab/TALES-Trajectories. To sup- port statistical analysis, we remove tasks with fewer than 30 task instances. B.1 Sub-Goal Reinforcement Learn...