arxiv: 2605.09228 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ProactBench: Beyond What The User Asked For

Sepehr Harfi , Ahmad Salimi , Dongming Shen , Alex Smola

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords conversational proactivityLLM benchmarksimplied user needsrecovery evaluationmulti-agent dialoguefrontier modelsdialogue systems

0 comments

The pith

Recovery after task completion is difficult for LLMs and weakly tied to standard benchmarks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most LLM benchmarks only score responses to explicit user requests. This paper introduces ProactBench to measure conversational proactivity, the skill of noticing and acting on needs the user has implied but not stated. It divides the skill into three types: inference from a single anchor, synthesis across multiple anchors, and recovery with forward-looking value once the stated task ends. A three-agent system generates test cases while keeping information separate to avoid style bias or leaked answers. Tests across sixteen models show recovery stands out as both hard and only weakly linked to six common benchmarks.

Core claim

ProactBench decomposes conversational proactivity into Emergent, Critical, and Recovery phases. It uses a Planner, User Agent, and Assistant Model with information asymmetries to produce 198 dialogues containing 624 trigger points across 24 communication styles. Evaluation of frontier and open-weight models shows Recovery is difficult and weakly predicted by existing benchmarks, establishing it as a distinct evaluation signal.

What carries the argument

The three-agent architecture with Planner, User Agent, and Assistant Model that maintains information asymmetries to generate unbiased trigger points for proactivity evaluation.

If this is right

Recovery performance can function as an independent metric when comparing how models handle real conversations.
Standard benchmarks leave out important aspects of helpfulness that involve anticipating unstated needs.
The 624 trigger points across 24 styles allow testing model robustness to different user communication patterns.
Improving recovery may require training methods that emphasize post-task forward-looking inference rather than explicit instructions alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interfaces built around recovery checks could lower user frustration during extended chat sessions.
Developers could apply the benchmark to spot gaps in training data that favor explicit over implicit user signals.
The information-asymmetry method might extend to testing other subtle skills such as timely clarification requests.

Load-bearing premise

The three-agent setup with information asymmetries successfully prevents style confounding, rubric leakage, and information dumps without introducing new artifacts.

What would settle it

Observing a strong correlation between recovery scores and performance on the six standard benchmarks across additional models would indicate recovery does not provide a useful new signal.

Figures

Figures reproduced from arXiv: 2605.09228 by Ahmad Salimi, Alex Smola, Dongming Shen, Sepehr Harfi.

**Figure 2.** Figure 2: Per-model pass rate by trigger type. The drop from [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Pairwise Pearson correlations across six standard benchmarks and three proactivity trigger types, computed over 16 models (95% bootstrap CIs in Appendix K). Existing benchmarks intercorrelate at r = 0.64 to 0.97; EMERGENT and CRITICAL fit within this regime (r¯ ≈ 0.83). RECOVERY breaks the pattern: r¯ = 0.51, 95% CI [0.29, 0.71]. weight models (Qwen3.5-397B-A17B, Kimi-K2.6, DeepSeek-V4-Flash, Llama-4-Mave… view at source ↗

**Figure 4.** Figure 4: Logit-transformed per-(model, style) pass rates regressed against a shared style-difficulty axis (the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Cohen’s κ per trigger type for each judge pair, with 95% bootstrap CI. Dashed line at 0.40 marks the conventional “moderate” threshold; dotted line at 0.60 marks “substantial.” EMERGENT sits at the moderate boundary; CRITICAL is the noisiest dimension across all three pairs. G.3 RECOVERY gap and disagreement direction The most informative test is whether GPT-5.5’s RECOVERY lead survives a cross-family judg… view at source ↗

**Figure 6.** Figure 6: RECOVERY weighted score (%) by evaluated model under each of the three judges, with 95% bootstrap CI error bars. GPT-5.5 is the top model under every judge; the magnitude of its lead compresses under cross-family judges but never reverses. Disagreement direction. Across all triggers pooled, the cross-family judges agree with GPT-5.4 roughly 60–65% of the time, with disagreements distributed roughly symmetr… view at source ↗

**Figure 7.** Figure 7: Weighted score (%) per (evaluated model × judge) cell, broken out by trigger type. Aligned-trigger subset only (n = 132–167 per model). The Overall column shows that top overall rankings shift under cross-family judges, while the per-type panels show the type-specific patterns described above, including GPT-5.5’s preserved RECOVERY lead. G.4 Curation-model contamination ablation The main offline-evaluation… view at source ↗

**Figure 8.** Figure 8: Per-model logit pass rates plotted against a shared stage-difficulty axis. Each polyline connects one [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗

read the original abstract

Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this \emph{conversational proactivity}. ProactBench decomposes it into three phase-tied types: \textsc{Emergent}, inference from a single disclosed anchor; \textsc{Critical}, synthesis across multiple anchors; and \textsc{Recovery}, grounded forward-looking value after task completion. We operationalise the benchmark with three agents: a Planner, a User Agent, and an Assistant Model. Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps. The released corpus contains 198 curated dialogues with 624 trigger points across 24 communication styles drawn from a psychometric inventory and audited by an independent LLM judge. Across 16 frontier and open-weight models, \textsc{Recovery} is both difficult and weakly predicted by six standard benchmarks, making it a useful new evaluation signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProactBench introduces a three-phase benchmark for LLM conversational proactivity using multi-agent information asymmetries, and shows Recovery is hard and weakly tied to standard tests, but the protocol's ability to avoid artifacts needs more proof.

read the letter

The main point is that this paper builds ProactBench to measure how LLMs notice and act on needs the user implies but does not state outright. They split it into Emergent (one anchor), Critical (multiple anchors), and Recovery (forward-looking value after the main task), then generate and score dialogues with a Planner, User Agent, and Assistant that have deliberately limited information to cut down on style issues and leakage. The released set has 198 dialogues and 624 triggers across 24 styles, checked by an outside LLM judge, and they run it on 16 models. The standout result is that Recovery turns out difficult and does not track closely with six existing benchmarks, which suggests it captures something separate worth tracking in assistant training. That is a concrete step beyond the usual explicit-request tests. The multi-agent design is a sensible way to try to keep the scoring honest, and actually shipping the corpus plus the model results gives others something to use or extend. The soft spot is the validation of that design. The abstract claims the information splits block confounding, rubric leakage, and artificial difficulty, but without seeing the precise knowledge partitions or any ablation tests on them, it is still possible that handoffs or residual signals affect the Recovery scores. The weak-correlation claim is important to the paper's value, so the full numbers, how the six benchmarks were aligned to the same dialogues, and any error bars matter. If those details are thin, the distinct-signal argument weakens. This is for people who work on LLM evaluation, conversational agents, or benchmark construction. A reader focused on proactivity or implied-intent testing would get practical value from the phases and the data. It is solid enough to deserve a serious referee, who could press on the protocol checks and the statistical comparisons. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces ProactBench to measure conversational proactivity in LLMs—the ability to notice and act on implied but unstated user needs. It decomposes proactivity into three phase-tied types (Emergent: inference from one anchor; Critical: synthesis across anchors; Recovery: grounded forward-looking value post-task) and operationalizes the benchmark via a three-agent protocol (Planner, User Agent, Assistant) whose information asymmetries are intended to block style confounding, rubric leakage, and information dumps. The released corpus comprises 198 dialogues with 624 trigger points spanning 24 psychometric communication styles; evaluations across 16 models indicate that Recovery is difficult and only weakly predicted by six standard benchmarks, positioning it as a distinct evaluation signal.

Significance. If the multi-agent construction successfully isolates genuine proactivity without introducing new artifacts, the reported weak correlation between Recovery scores and existing benchmarks would constitute a useful new signal for capabilities not captured by instruction-following evaluations. The release of the curated corpus and the psychometric grounding of styles are concrete strengths that could enable follow-on work.

major comments (2)

[§3] §3 (three-agent operationalization): The central claim that Recovery scores reflect proactivity rather than protocol artifacts rests on the assertion that Planner/User-Agent/Assistant information asymmetries prevent style confounding, rubric leakage, and information dumps. The manuscript does not enumerate the precise knowledge partitions (e.g., whether the User Agent ever receives the Planner’s trigger list, the Assistant’s prior turns, or the full rubric), leaving open the possibility that residual style signals or handoff artifacts inflate Recovery difficulty and produce the observed weak correlations with the six external benchmarks.
[Results] Results section (evaluation of 16 models): The claim that Recovery “is both difficult and weakly predicted” by six standard benchmarks is load-bearing for the paper’s contribution. Without reported correlation coefficients, confidence intervals, or explicit exclusion criteria for the six benchmarks, it is impossible to verify that the weak relationship is statistically distinguishable from noise or from the difficulty of the task itself.

minor comments (2)

[Abstract] The abstract introduces “anchor” and “trigger points” without a concise definition on first use; a parenthetical gloss would improve readability.
[§3] The auditing procedure by the independent LLM judge is mentioned but lacks details on prompt, agreement metric, or disagreement resolution; these should be added to the corpus-construction subsection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested clarifications.

read point-by-point responses

Referee: [§3] §3 (three-agent operationalization): The central claim that Recovery scores reflect proactivity rather than protocol artifacts rests on the assertion that Planner/User-Agent/Assistant information asymmetries prevent style confounding, rubric leakage, and information dumps. The manuscript does not enumerate the precise knowledge partitions (e.g., whether the User Agent ever receives the Planner’s trigger list, the Assistant’s prior turns, or the full rubric), leaving open the possibility that residual style signals or handoff artifacts inflate Recovery difficulty and produce the observed weak correlations with the six external benchmarks.

Authors: We acknowledge that the original description of the three-agent protocol, while outlining the intended information asymmetries, did not include an exhaustive enumeration of knowledge partitions. In the revised manuscript we have added a dedicated table in §3 that specifies the exact information available to each agent at every stage. The User Agent receives only the current simulated utterance and dialogue history and has no access to the Planner’s trigger list or the full rubric; the Assistant receives solely the conversation history without any prior knowledge of triggers, styles, or evaluation criteria. This explicit partitioning directly mitigates concerns about residual style signals or handoff artifacts and supports the claim that Recovery scores reflect proactivity rather than protocol effects. revision: yes
Referee: [Results] Results section (evaluation of 16 models): The claim that Recovery “is both difficult and weakly predicted” by six standard benchmarks is load-bearing for the paper’s contribution. Without reported correlation coefficients, confidence intervals, or explicit exclusion criteria for the six benchmarks, it is impossible to verify that the weak relationship is statistically distinguishable from noise or from the difficulty of the task itself.

Authors: We agree that quantitative statistical support is required to substantiate the claim. The revised Results section now includes a table reporting Pearson and Spearman correlations between Recovery scores and each of the six benchmarks, together with 95% confidence intervals and p-values. We have also added explicit selection criteria for the benchmarks (representative instruction-following, reasoning, and knowledge tasks) and note that the observed correlations remain low (|r| < 0.3) even after accounting for task difficulty. These additions allow readers to confirm that the weak relationship is statistically distinguishable from stronger correlations seen for the other proactivity phases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark definition or empirical claims

full rationale

The paper constructs ProactBench via a three-agent protocol (Planner/User-Agent/Assistant) to generate 198 dialogues with 624 trigger points, then runs 16 models to measure Recovery difficulty and its weak correlation with six external benchmarks. No equations, fitted parameters, self-citations, or ansatzes appear in the derivation; the central result is an empirical observation from the released corpus and model evaluations rather than a quantity forced by construction or prior self-referential definitions. The three-agent asymmetries are presented as a methodological choice whose effectiveness is left to external verification, not asserted by internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on standard assumptions about what constitutes a valid trigger point and on the psychometric inventory for communication styles; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption The three-phase taxonomy (Emergent, Critical, Recovery) captures the main forms of conversational proactivity.
Stated in the abstract as the decomposition used to operationalize the benchmark.
domain assumption Information asymmetries between Planner, User Agent, and Assistant Model eliminate style confounding and leakage.
Abstract claims these asymmetries defend against listed confounds.

pith-pipeline@v0.9.0 · 5480 in / 1307 out tokens · 46478 ms · 2026-05-12T02:04:19.417084+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RECOVERY is both difficult and weakly predicted by six standard benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

145 extracted references · 145 canonical work pages · 11 internal anchors

[5]

Kaur, Kirandeep and Gupta, Vinayak and Gupta, Aditya and Shah, Chirag , journal=

work page
[6]

Samarinas, Chris and Zamani, Hamed , journal=. Pro

work page
[7]

A survey on proactive dialogue systems: Problems, methods, and prospects.arXiv preprint arXiv:2305.02750, 2023

A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects , author=. arXiv preprint arXiv:2305.02750 , url=

work page arXiv
[8]

Proactive Conversational

Deng, Yang and Liao, Lizi and Lei, Wenqiang and Yang, Grace Hui and Lam, Wai and Chua, Tat-Seng , journal=. Proactive Conversational. 2025 , doi=

work page 2025
[9]

Findings of EMNLP , year=

Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration , author=. Findings of EMNLP , year=

work page
[10]

Proceedings of SIGIR , year=

Towards Human-centered Proactive Conversational Agents , author=. Proceedings of SIGIR , year=

work page
[11]

Proceedings of IJCAI , pages=

Smarter Response with Proactive Suggestion: A New Generative Neural Conversation Paradigm , author=. Proceedings of IJCAI , pages=. 2018 , url=

work page 2018
[12]

Zheng, Lianmin and Chiang, Wei-Lin and others , journal=. Judging

work page
[13]

WildBench: Benchmarking

Lin, Bill Yuchen and Deng, Yuntian and others , journal=. WildBench: Benchmarking

work page
[14]

Chatbot Arena: An Open Platform for Evaluating

Chiang, Wei-Lin and Zheng, Lianmin and others , journal=. Chatbot Arena: An Open Platform for Evaluating

work page
[15]

, journal=

Chang, Serina and Anderson, Ashton and Hofman, Jake M. , journal=. ChatBench: From Static Benchmarks to Human-

work page
[17]

Gan, Yujian and Li, Changling and others , journal=. ClarQ-

work page
[18]

arXiv preprint arXiv:2602.03429 , url=

DiscoverLLM: From Executing Intents to Discovering Them , author=. arXiv preprint arXiv:2602.03429 , url=

work page internal anchor Pith review arXiv
[20]

Zhou, Xuhui and Zhu, Hao and Mathur, Leena and others , journal=

work page
[21]

Dong, Wenjie and Chen, Sirong and Yang, Yan , booktitle=. Pro. 2025 , url=

work page 2025
[22]

Proceedings of ICLR , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of ICLR , year=

work page
[23]

2024 , url=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and others , booktitle=. 2024 , url=

work page 2024
[24]

2024 , url=

Jimenez, Carlos E and Yang, John and Wettig, Alexander and others , booktitle=. 2024 , url=

work page 2024
[27]

2025 , howpublished=

work page 2025
[28]

MedDialBench: Benchmarking

Luo, Xiaotian and Jiang, Xun and Wu, Jiangcheng , journal=. MedDialBench: Benchmarking

work page
[29]

and Bakker-Pieper, Angelique and Konings, Femke E

de Vries, Reinout E. and Bakker-Pieper, Angelique and Konings, Femke E. and Schouten, Barbara , journal=. The. 2013 , publisher=

work page 2013
[30]

Communication Research , volume=

The Content and Dimensionality of Communication Styles , author=. Communication Research , volume=. 2009 , publisher=

work page 2009
[31]

Psychometric Properties and a Preliminary Validation Study of the Italian Brief Version of the Communication Styles Inventory (

Diotaiuti, Pierluigi and Valente, Giuseppe and Mancone, Stefania and Grambone, Angela , journal=. Psychometric Properties and a Preliminary Validation Study of the Italian Brief Version of the Communication Styles Inventory (. 2020 , doi=

work page 2020
[32]

Tint, Joshua and Sagar, Som and others , journal=

work page
[33]

arXiv preprint arXiv:2501.00383 , url=

Proactive Conversational Agents with Inner Thoughts , author=. arXiv preprint arXiv:2501.00383 , url=

work page arXiv
[34]

Huang, Shuai and Zhao, Wenxuan and Gao, Jun , journal=

work page
[35]

Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing

Qiu, Huachuan and Lan, Zhenzhong , journal=. Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing

work page
[36]

Yang, An and others , journal=

work page
[37]

2026 , eprint=

Submodular Benchmark Selection , author=. 2026 , eprint=

work page 2026
[38]

Near-Optimal Sensor Placements in

Guestrin, Carlos and Krause, Andreas and Singh, Ajit Paul , booktitle=. Near-Optimal Sensor Placements in. 2005 , url=

work page 2005
[39]

Neural Theory-of-Mind? On the Limits of Social Intelligence in Large

Sap, Maarten and LeBras, Ronan and Fried, Daniel and Choi, Yejin , booktitle=. Neural Theory-of-Mind? On the Limits of Social Intelligence in Large. 2022 , url=

work page 2022
[40]

2023 , url=

Kim, Hyunwoo and Sclar, Melanie and Zhou, Xuhui and Bras, Ronan Le and Kim, Gunhee and Choi, Yejin and Sap, Maarten , booktitle=. 2023 , url=

work page 2023
[41]

Proceedings of EMNLP-IJCNLP , year=

Revisiting the Evaluation of Theory of Mind through Question Answering , author=. Proceedings of EMNLP-IJCNLP , year=

work page
[42]

Nature Human Behaviour , year=

Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , year=

work page
[43]

Large language models fail on trivial alterations to theory-of-mind tasks, 2023

Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks , author=. arXiv preprint arXiv:2302.08399 , url=

work page arXiv
[44]

Proceedings of ICLR , year=

Towards Understanding Sycophancy in Language Models , author=. Proceedings of ICLR , year=

work page
[45]

Findings of ACL , year=

Discovering Language Model Behaviors with Model-Written Evaluations , author=. Findings of ACL , year=

work page
[46]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and others , journal=. Constitutional

work page
[48]

arXiv preprint arXiv:2212.10711 , url=

Task Ambiguity in Humans and Language Models , author=. arXiv preprint arXiv:2212.10711 , url=

work page arXiv
[51]

Proceedings of SIGIR , year=

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations , author=. Proceedings of SIGIR , year=

work page
[52]

2024 , eprint=

Shi, Taiwei and Wang, Zhuoer and Yang, Longqi and Lin, Ying-Chun and He, Zexue and Wan, Mengting and Zhou, Pei and Jauhar, Sujay and Chen, Sihao and Xia, Shan and Zhang, Hongfei and Zhao, Jieyu and Xu, Xiaofeng and Song, Xia and Neville, Jennifer , booktitle=. 2024 , eprint=

work page 2024
[53]

Transactions of the Association for Computational Linguistics , volume=

Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , url=

work page 2024
[54]

Needle in a Haystack -- Pressure Testing

Kamradt, Greg , year=. Needle in a Haystack -- Pressure Testing

work page
[55]

2024 , url=

An, Chenxin and Gong, Shansan and Zhong, Ming and Zhao, Xingjian and Li, Mukai and Zhang, Jun and Kong, Lingpeng and Qiu, Xipeng , booktitle=. 2024 , url=

work page 2024
[56]

Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and others , journal=

work page
[57]

2024 , url=

Bai, Yushi and Lv, Xin and Zhang, Jiajie and others , booktitle=. 2024 , url=

work page 2024
[58]

2024 , url=

Wang, Xingyao and Wang, Zihan and Liu, Jiateng and others , booktitle=. 2024 , url=

work page 2024
[59]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , url=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Proceedings of ICLR , year=

Mialon, Gr. Proceedings of ICLR , year=

work page
[61]

Cognitive Science , volume=

Contributing to discourse , author=. Cognitive Science , volume=. 1989 , url=

work page 1989
[62]

1996 , url=

Using Language , author=. 1996 , url=

work page 1996
[63]

Journal of Experimental Psychology: Learning, Memory, and Cognition , volume=

Conceptual pacts and lexical choice in conversation , author=. Journal of Experimental Psychology: Learning, Memory, and Cognition , volume=. 1996 , url=

work page 1996
[64]

2024 , url=

Salemi, Alireza and Mysore, Sheshera and Bendersky, Michael and Zamani, Hamed , booktitle=. 2024 , url=

work page 2024
[65]

Foundations and Trends in Information Retrieval , volume=

Conversational Information Seeking , author=. Foundations and Trends in Information Retrieval , volume=. 2023 , url=

work page 2023
[66]

and Feng, Shi , booktitle=

Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , booktitle=. 2024 , url=

work page 2024
[68]

Proceedings of ACL , year=

Large Language Models are not Fair Evaluators , author=. Proceedings of ACL , year=

work page
[69]

NeurIPS Datasets and Benchmarks , year=

Benchmarking Foundation Models with Language-Model-as-an-Examiner , author=. NeurIPS Datasets and Benchmarks , year=

work page
[70]

Quantifying the Persona Effect in

Hu, Tiancheng and Collier, Nigel , booktitle=. Quantifying the Persona Effect in. 2024 , url=

work page 2024
[71]

Proceedings of EACL , year=

Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting , author=. Proceedings of EACL , year=

work page
[72]

A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models

A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models , author=. arXiv preprint arXiv:2504.07070 , url=

work page arXiv
[73]

2025 , publisher=

Nemotron-Personas: A Collection of Synthetic Persona Datasets Aligned to Real-World Distributions , author=. 2025 , publisher=

work page 2025
[74]

Transactions on Machine Learning Research , year=

Emergent abilities of large language models , author=. Transactions on Machine Learning Research , year=

work page
[75]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Are Emergent Abilities of Large Language Models a Mirage? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[76]

Polo, Felipe Maia and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail , booktitle=. tiny. 2024 , url=

work page 2024
[77]

2025 , howpublished=

Introducing. 2025 , howpublished=

work page 2025
[78]

2026 , howpublished=

Introducing. 2026 , howpublished=

work page 2026
[79]

2024 , howpublished=

Hello. 2024 , howpublished=

work page 2024
[80]

2026 , howpublished=

work page 2026
[81]

2025 , howpublished=

The. 2025 , howpublished=

work page 2025
[82]

2024 , howpublished=

work page 2024
[84]

Educational and Psychological Measurement , volume=

A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , volume=. 1960 , doi=

work page 1960
[85]

Psychological Bulletin , volume=

Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit , author=. Psychological Bulletin , volume=. 1968 , doi=

work page 1968
[86]

The American Journal of Psychology , volume=

The Proof and Measurement of Association Between Two Things , author=. The American Journal of Psychology , volume=. 1904 , doi=

work page 1904
[87]

The Annals of Statistics , volume=

Bootstrap Methods: Another Look at the Jackknife , author=. The Annals of Statistics , volume=. 1979 , doi=

work page 1979
[88]

Communications of the ACM , volume=

Datasheets for Datasets , author=. Communications of the ACM , volume=. 2021 , doi=

work page 2021
[89]

Croissant: A Metadata Format for

Akhtar, Mubashara and Benjelloun, Omar and Conforti, Costanza and Gijsbers, Pieter and Giner-Miguelez, Joan and Jain, Nitisha and Kuchnik, Michael and Lhoest, Quentin and Marcenac, Pierre and Maskey, Manil and Mattson, Peter and Oala, Luis and Ruyssen, Pierre and Shinde, Rajat and Simperl, Elena and Thomas, Goeffry and Tykhonov, Slava and Vanschoren, Joaq...

work page
[90]

Human Communication Research , volume=

Reliability in Content Analysis: Some Common Misconceptions and Recommendations , author=. Human Communication Research , volume=. 2004 , doi=

work page 2004
[91]

Scandinavian Journal of Statistics , volume=

A Simple Sequentially Rejective Multiple Test Procedure , author=. Scandinavian Journal of Statistics , volume=. 1979 , publisher=

work page 1979
[92]

Journal of the Royal Statistical Society

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing , author=. Journal of the Royal Statistical Society. Series B (Methodological) , volume=. 1995 , publisher=

work page 1995
[93]

Croissant: A metadata format for ML -ready datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Susheel Varma, Jos van der Velde, Steffen Vogler, Carole-Jean Wu...

work page arXiv 2024

Showing first 80 references.