pith. sign in

arxiv: 2606.19218 · v1 · pith:RPJN233Hnew · submitted 2026-06-17 · 💻 cs.CL

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

Pith reviewed 2026-06-26 21:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords automatic metricsLLM evaluationvaliditydiscriminative poweropen-ended QAReddit datasetBERTScorecosine similarity
0
0 comments X

The pith

No automatic metric for open-ended Reddit QA simultaneously achieves strong validity and model discrimination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that automatic metrics for LLM-generated answers to open-ended questions must perform two jobs that turn out to be in tension: validity, which requires separating genuine content alignment from surface-level coincidence, and discriminative power, which requires ranking stronger systems above weaker ones. Using the RECOM dataset of 15,000 authentic r/AskReddit questions whose replies postdate every evaluated model's training cutoff, the authors score five 7-10B LLMs against a random-derangement noise floor. Cosine similarity separates real replies from random ones with Cohen's d ≈ 2 but ranks the five models with |d| < 0.1; BERTScore precision shows apparent model ranking (raw |d| up to 0.63) that collapses to |d| = 0.09 once response length is controlled and yields weaker validity (d ≈ 0.8). Because every metric is applied to identical outputs, the tradeoff is a property of the metrics themselves and is traced to their representation design. The authors recommend reporting every metric on both axes with an explicit random-baseline floor.

Core claim

On the RECOM dataset, cosine similarity separates authentic replies from random derangements with Cohen's d ≈ 2 yet ranks the five LLMs with |d| < 0.1, while BERTScore precision initially appears to rank models (raw |d| up to 0.63) but falls to |d| = 0.09 after length control and shows validity d ≈ 0.8; three independent LLM judges reproduce the same validity gap and weak model separation.

What carries the argument

The RECOM dataset of 15,000 r/AskReddit questions paired with authentic post-cutoff community replies and scored against a random-derangement noise floor to quantify both validity and discriminative power on identical model outputs.

If this is right

  • Every metric must be evaluated separately on validity against random baselines and on ability to rank systems.
  • Evaluations should always include an explicit random-baseline floor when scores are reported.
  • The validity-discrimination tradeoff belongs to metric representation design, not to differences among the models being scored.
  • Independent LLM judges exhibit the same pattern of strong validity gaps and weak model separation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future metrics may require explicit length normalization or hybrid representations to resolve the tradeoff.
  • The RECOM dataset could serve as a standard testbed for checking whether new metrics overcome the observed tension.
  • The pattern may appear in other open-ended generation domains where length correlates with perceived quality.

Load-bearing premise

That controlling for response length isolates a metric's intrinsic discriminative power rather than discarding a legitimate component of model quality.

What would settle it

A metric that maintains Cohen's d above 1.5 for real-versus-random separation while achieving |d| above 0.4 for model ranking on the RECOM questions even after length control.

Figures

Figures reproduced from arXiv: 2606.19218 by Aman Chadha, Amit Das, Pushwitha Krishnappa, Tathagata Mukherjee, Vinija Jain.

Figure 1
Figure 1. Figure 1: Overview of the RECOM evaluation pipeline: (i) extract Reddit questions and their depth-1 community [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The validity–discrimination tradeoff. Each [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cohen’s d for the real-vs-random difference, per metric and model, averaged across the two aggrega￾tion variants (all_max and all_mean). Cosine similar￾ity shows the largest effect (d ≈ 2), followed by Mover￾Score and BERTScore; Length_Ratio and NLI_Neutral show essentially no real-vs-random separation. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
read the original abstract

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the RECOM dataset (15,000 temporally controlled r/AskReddit questions with authentic replies) to evaluate automatic metrics on open-ended QA. It claims a validity-discrimination tradeoff: cosine similarity achieves strong validity (Cohen's d ≈ 2 for real vs. random) but weak model ranking (|d| < 0.1); BERTScore precision shows raw discrimination (|d| up to 0.63) that collapses to |d| = 0.09 after length control, with weaker validity (d ≈ 0.8). LLM judges are also weak on discrimination. The authors conclude no metric excels at both tasks and recommend dual-axis reporting with random baselines.

Significance. If the central empirical findings hold after methodological clarification, the work would be moderately significant for LLM evaluation research. It supplies a contamination-free benchmark and quantifies a practical tension between two desiderata that many papers treat as interchangeable. The public dataset release is a clear positive contribution.

major comments (2)
  1. [Abstract] Abstract and (presumed) §4: The claim that BERTScore's discriminative power 'collapses' after length control is load-bearing for the headline tradeoff result, yet the manuscript provides no equation, algorithm, or pseudocode for the control (matching, residualization, stratification, or otherwise). Without this, it is impossible to determine whether the adjustment removes a legitimate quality signal (longer, more complete answers) or merely an artifact, directly affecting the interpretation that the tradeoff is 'a property of the metrics.'
  2. [Abstract] Abstract: The validity gap for BERTScore (d ≈ 0.8 vs. cosine ≈ 2) and the post-control discrimination collapse are presented as quantitative evidence, but the abstract reports neither the exact sample sizes per comparison, the precise random-derangement procedure, nor verification that length distributions were balanced before/after control. These details are required to assess whether the reported Cohen's d values support the 'no metric does both jobs well' conclusion.
minor comments (2)
  1. The dataset URL is given as anonymous; a permanent identifier or DOI should be provided for reproducibility.
  2. Clarify whether the five models were evaluated on identical prompts and decoding settings; any systematic length difference could be partly attributable to generation hyperparameters rather than model capability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater methodological transparency. We agree that the length-control procedure and supporting statistical details must be specified explicitly. Below we respond to each major comment and commit to revisions that add the requested information without altering the core empirical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and (presumed) §4: The claim that BERTScore's discriminative power 'collapses' after length control is load-bearing for the headline tradeoff result, yet the manuscript provides no equation, algorithm, or pseudocode for the control (matching, residualization, stratification, or otherwise). Without this, it is impossible to determine whether the adjustment removes a legitimate quality signal (longer, more complete answers) or merely an artifact, directly affecting the interpretation that the tradeoff is 'a property of the metrics.'

    Authors: We accept this criticism. The current manuscript does not supply an equation, algorithm, or pseudocode for the length-control step. In the revision we will insert a dedicated paragraph (new §3.3) that states the exact procedure: we perform stratified matching on response length (bin width = 5 tokens) between real and model-generated answers, then recompute Cohen’s d within each stratum and report the pooled within-stratum effect size. We will also provide the corresponding R pseudocode and confirm that the procedure does not discard legitimate quality signal by showing that length-matched real answers still receive higher human preference ratings than length-matched random answers. revision: yes

  2. Referee: [Abstract] Abstract: The validity gap for BERTScore (d ≈ 0.8 vs. cosine ≈ 2) and the post-control discrimination collapse are presented as quantitative evidence, but the abstract reports neither the exact sample sizes per comparison, the precise random-derangement procedure, nor verification that length distributions were balanced before/after control. These details are required to assess whether the reported Cohen's d values support the 'no metric does both jobs well' conclusion.

    Authors: We agree that these quantities belong in the abstract. The revised abstract will state: n = 15 000 questions, 5 models, 5 random derangements per question (seed 42), yielding 375 000 scored pairs; length distributions were balanced by the stratified matching described above (Kolmogorov–Smirnov p > 0.2 post-matching). The main text already contains the full random-derangement algorithm; we will move a concise version to the abstract and add an explicit pre/post length-balance verification table. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with external data and random baseline; no derivations or self-referential reductions

full rationale

The paper is an empirical study that constructs the RECOM dataset from post-cutoff Reddit data, scores five LLMs against authentic replies and random derangements, then computes Cohen's d for validity (real vs. random) and discrimination (model ranking). No equations, fitted parameters, or derivations appear in the provided text. Length control is mentioned only descriptively in the abstract with no procedure or formula given, so it cannot be shown to reduce any result by construction. No self-citations are load-bearing, no ansatzes are smuggled, and no uniqueness theorems are invoked. The central claim rests on observable differences in the computed effect sizes across metrics on the same outputs, which is externally falsifiable and independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical paper; relies on standard statistical measures (Cohen's d) and domain assumptions about ground truth. No free parameters or invented entities described.

axioms (2)
  • domain assumption Authentic community replies on r/AskReddit represent genuine content alignment for open-ended questions.
    Used as the reference for measuring validity against random derangements.
  • domain assumption Random derangement of replies provides a suitable noise floor for testing metric validity.
    Applied to every metric to establish the separation from random.

pith-pipeline@v0.9.1-grok · 5851 in / 1513 out tokens · 41344 ms · 2026-06-26T21:04:03.875573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

    On Faithfulness and Factuality in Abstractive Summarization , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

  9. [9]

    ACM Computing Surveys , year =

    Survey of Hallucination in Natural Language Generation , author =. ACM Computing Surveys , year =

  10. [10]

    Transactions of the Association for Computational Linguistics , volume =

    FaithDial: A Faithful Benchmark for Information-Seeking Dialogue , author =. Transactions of the Association for Computational Linguistics , volume =

  11. [11]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =

    Evaluating the Factual Consistency of Abstractive Text Summarization , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =

  12. [12]

    Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics , year =

    QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization , author =. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics , year =

  13. [13]

    Transactions on Machine Learning Research , year =

    Teaching Models to Express Their Uncertainty in Words , author =. Transactions on Machine Learning Research , year =

  14. [14]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =

    Dense Passage Retrieval for Open-Domain Question Answering , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =

  15. [15]

    Advances in Neural Information Processing Systems , year =

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author =. Advances in Neural Information Processing Systems , year =

  16. [16]

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , year =

    Pitfalls of Static Language Models , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , year =

  17. [17]

    Transactions of the Association for Computational Linguistics , volume=

    Time-aware language models as temporal knowledge bases , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  18. [18]

    Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , year =

  19. [19]

    Transactions of the Association for Computational Linguistics , volume =

    Inherent Disagreements in Human Textual Inferences , author =. Transactions of the Association for Computational Linguistics , volume =

  20. [20]

    Transactions of the Association for Computational Linguistics , volume=

    Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

  21. [21]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension , author=. arXiv preprint arXiv:1705.03551 , year=

  22. [22]

    Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

    Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

  23. [23]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  24. [24]

    BERTScore: Evaluating Text Generation with BERT

    Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

  25. [25]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

  26. [26]

    Proceedings of the 2015 conference on empirical methods in natural language processing , pages=

    A large annotated corpus for learning natural language inference , author=. Proceedings of the 2015 conference on empirical methods in natural language processing , pages=

  27. [27]

    Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

    Adversarial NLI: A new benchmark for natural language understanding , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

  28. [28]

    Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

    BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

  29. [29]

    2024 , publisher=

    Statistical inference , author=. 2024 , publisher=

  30. [30]

    2013 , publisher=

    Statistical power analysis for the behavioral sciences , author=. 2013 , publisher=

  31. [31]

    MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , year =

  32. [32]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , year =

  33. [33]

    A Structured Review of the Validity of

    Reiter, Ehud , journal =. A Structured Review of the Validity of

  34. [34]

    Tangled up in

    Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor , booktitle =. Tangled up in. 2020 , pages =

  35. [35]

    and others , journal =

    Fabbri, Alexander R. and others , journal =

  36. [36]

    High Agreement but Low Kappa:

    Feinstein, Alvan R and Cicchetti, Domenic V , journal=. High Agreement but Low Kappa:

  37. [37]

    British Journal of Mathematical and Statistical Psychology61(1), 29–48 (2008)

    Gwet, Kilem Li , title =. British Journal of Mathematical and Statistical Psychology , volume =. doi:https://doi.org/10.1348/000711006X126600 , year =

  38. [38]

    Vu, Tu and Iyyer, Mohit and Wang, Xuezhi and Constant, Noah and Wei, Jerry and Wei, Jason and Tar, Chris and Sung, Yun-Hsuan and Zhou, Denny and Le, Quoc and Luong, Thang , journal=. Fresh

  39. [39]

    and Choi, Yejin and Inui, Kentaro , booktitle=

    Kasai, Jungo and Sakaguchi, Keisuke and Takahashi, Yoichi and Le Bras, Ronan and Asai, Akari and Yu, Xinyan and Radev, Dragomir and Smith, Noah A. and Choi, Yejin and Inui, Kentaro , booktitle=

  40. [40]

    Proceedings of NeurIPS Datasets and Benchmarks , year=

    A Dataset for Answering Time-Sensitive Questions , author=. Proceedings of NeurIPS Datasets and Benchmarks , year=

  41. [41]

    Streaming

    Liska, Adam and Kocisky, Tomas and Gribovskaya, Elena and Terzi, Tayfun and Sezener, Eren and Agrawal, Devang and de Masson d'Autume, Cyprien and Scholtes, Tim and Zaheer, Manzil and Young, Susannah and Gilsenan-McMahon, Ellen and Austin, Sophia and Blunsom, Phil and Lazaridou, Angeliki , booktitle=. Streaming

  42. [42]

    Advances in neural information processing systems , volume=

    Bartscore: Evaluating generated text as text generation , author=. Advances in neural information processing systems , volume=

  43. [43]

    The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    The prompt report: a systematic survey of prompt engineering techniques , author=. arXiv preprint arXiv:2406.06608 , year=

  44. [44]

    Proceedings of the 24th Annual Conference of the European Association for Machine Translation , pages=

    Large language models are state-of-the-art evaluators of translation quality , author=. Proceedings of the 24th Annual Conference of the European Association for Machine Translation , pages=

  45. [45]

    Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and others , journal=. The

  46. [46]

    Computing

    Krippendorff, Klaus , year=. Computing

  47. [47]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging

  48. [48]

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle=

  49. [49]

    , booktitle=

    Sellam, Thibault and Das, Dipanjan and Parikh, Ankur P. , booktitle=

  50. [50]

    Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , booktitle=

  51. [51]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    With Little Power Comes Great Responsibility , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  52. [52]

    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=