RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering
Pith reviewed 2026-06-26 21:04 UTC · model grok-4.3
The pith
No automatic metric for open-ended Reddit QA simultaneously achieves strong validity and model discrimination.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the RECOM dataset, cosine similarity separates authentic replies from random derangements with Cohen's d ≈ 2 yet ranks the five LLMs with |d| < 0.1, while BERTScore precision initially appears to rank models (raw |d| up to 0.63) but falls to |d| = 0.09 after length control and shows validity d ≈ 0.8; three independent LLM judges reproduce the same validity gap and weak model separation.
What carries the argument
The RECOM dataset of 15,000 r/AskReddit questions paired with authentic post-cutoff community replies and scored against a random-derangement noise floor to quantify both validity and discriminative power on identical model outputs.
If this is right
- Every metric must be evaluated separately on validity against random baselines and on ability to rank systems.
- Evaluations should always include an explicit random-baseline floor when scores are reported.
- The validity-discrimination tradeoff belongs to metric representation design, not to differences among the models being scored.
- Independent LLM judges exhibit the same pattern of strong validity gaps and weak model separation.
Where Pith is reading between the lines
- Future metrics may require explicit length normalization or hybrid representations to resolve the tradeoff.
- The RECOM dataset could serve as a standard testbed for checking whether new metrics overcome the observed tension.
- The pattern may appear in other open-ended generation domains where length correlates with perceived quality.
Load-bearing premise
That controlling for response length isolates a metric's intrinsic discriminative power rather than discarding a legitimate component of model quality.
What would settle it
A metric that maintains Cohen's d above 1.5 for real-versus-random separation while achieving |d| above 0.4 for model ranking on the RECOM questions even after length control.
Figures
read the original abstract
Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the RECOM dataset (15,000 temporally controlled r/AskReddit questions with authentic replies) to evaluate automatic metrics on open-ended QA. It claims a validity-discrimination tradeoff: cosine similarity achieves strong validity (Cohen's d ≈ 2 for real vs. random) but weak model ranking (|d| < 0.1); BERTScore precision shows raw discrimination (|d| up to 0.63) that collapses to |d| = 0.09 after length control, with weaker validity (d ≈ 0.8). LLM judges are also weak on discrimination. The authors conclude no metric excels at both tasks and recommend dual-axis reporting with random baselines.
Significance. If the central empirical findings hold after methodological clarification, the work would be moderately significant for LLM evaluation research. It supplies a contamination-free benchmark and quantifies a practical tension between two desiderata that many papers treat as interchangeable. The public dataset release is a clear positive contribution.
major comments (2)
- [Abstract] Abstract and (presumed) §4: The claim that BERTScore's discriminative power 'collapses' after length control is load-bearing for the headline tradeoff result, yet the manuscript provides no equation, algorithm, or pseudocode for the control (matching, residualization, stratification, or otherwise). Without this, it is impossible to determine whether the adjustment removes a legitimate quality signal (longer, more complete answers) or merely an artifact, directly affecting the interpretation that the tradeoff is 'a property of the metrics.'
- [Abstract] Abstract: The validity gap for BERTScore (d ≈ 0.8 vs. cosine ≈ 2) and the post-control discrimination collapse are presented as quantitative evidence, but the abstract reports neither the exact sample sizes per comparison, the precise random-derangement procedure, nor verification that length distributions were balanced before/after control. These details are required to assess whether the reported Cohen's d values support the 'no metric does both jobs well' conclusion.
minor comments (2)
- The dataset URL is given as anonymous; a permanent identifier or DOI should be provided for reproducibility.
- Clarify whether the five models were evaluated on identical prompts and decoding settings; any systematic length difference could be partly attributable to generation hyperparameters rather than model capability.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater methodological transparency. We agree that the length-control procedure and supporting statistical details must be specified explicitly. Below we respond to each major comment and commit to revisions that add the requested information without altering the core empirical claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and (presumed) §4: The claim that BERTScore's discriminative power 'collapses' after length control is load-bearing for the headline tradeoff result, yet the manuscript provides no equation, algorithm, or pseudocode for the control (matching, residualization, stratification, or otherwise). Without this, it is impossible to determine whether the adjustment removes a legitimate quality signal (longer, more complete answers) or merely an artifact, directly affecting the interpretation that the tradeoff is 'a property of the metrics.'
Authors: We accept this criticism. The current manuscript does not supply an equation, algorithm, or pseudocode for the length-control step. In the revision we will insert a dedicated paragraph (new §3.3) that states the exact procedure: we perform stratified matching on response length (bin width = 5 tokens) between real and model-generated answers, then recompute Cohen’s d within each stratum and report the pooled within-stratum effect size. We will also provide the corresponding R pseudocode and confirm that the procedure does not discard legitimate quality signal by showing that length-matched real answers still receive higher human preference ratings than length-matched random answers. revision: yes
-
Referee: [Abstract] Abstract: The validity gap for BERTScore (d ≈ 0.8 vs. cosine ≈ 2) and the post-control discrimination collapse are presented as quantitative evidence, but the abstract reports neither the exact sample sizes per comparison, the precise random-derangement procedure, nor verification that length distributions were balanced before/after control. These details are required to assess whether the reported Cohen's d values support the 'no metric does both jobs well' conclusion.
Authors: We agree that these quantities belong in the abstract. The revised abstract will state: n = 15 000 questions, 5 models, 5 random derangements per question (seed 42), yielding 375 000 scored pairs; length distributions were balanced by the stratified matching described above (Kolmogorov–Smirnov p > 0.2 post-matching). The main text already contains the full random-derangement algorithm; we will move a concise version to the abstract and add an explicit pre/post length-balance verification table. revision: yes
Circularity Check
Empirical evaluation with external data and random baseline; no derivations or self-referential reductions
full rationale
The paper is an empirical study that constructs the RECOM dataset from post-cutoff Reddit data, scores five LLMs against authentic replies and random derangements, then computes Cohen's d for validity (real vs. random) and discrimination (model ranking). No equations, fitted parameters, or derivations appear in the provided text. Length control is mentioned only descriptively in the abstract with no procedure or formula given, so it cannot be shown to reduce any result by construction. No self-citations are load-bearing, no ansatzes are smuggled, and no uniqueness theorems are invoked. The central claim rests on observable differences in the computed effect sizes across metrics on the same outputs, which is externally falsifiable and independent of the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Authentic community replies on r/AskReddit represent genuine content alignment for open-ended questions.
- domain assumption Random derangement of replies provides a suitable noise floor for testing metric validity.
Reference graph
Works this paper leans on
-
[1]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[2]
Publications Manual , year = "1983", publisher =
1983
-
[3]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[4]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[5]
Dan Gusfield , title =. 1997
1997
-
[6]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[7]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[8]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =
On Faithfulness and Factuality in Abstractive Summarization , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =
-
[9]
ACM Computing Surveys , year =
Survey of Hallucination in Natural Language Generation , author =. ACM Computing Surveys , year =
-
[10]
Transactions of the Association for Computational Linguistics , volume =
FaithDial: A Faithful Benchmark for Information-Seeking Dialogue , author =. Transactions of the Association for Computational Linguistics , volume =
-
[11]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =
Evaluating the Factual Consistency of Abstractive Text Summarization , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =
2020
-
[12]
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics , year =
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization , author =. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics , year =
2022
-
[13]
Transactions on Machine Learning Research , year =
Teaching Models to Express Their Uncertainty in Words , author =. Transactions on Machine Learning Research , year =
-
[14]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =
Dense Passage Retrieval for Open-Domain Question Answering , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , year =
2020
-
[15]
Advances in Neural Information Processing Systems , year =
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author =. Advances in Neural Information Processing Systems , year =
-
[16]
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , year =
Pitfalls of Static Language Models , author =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , year =
-
[17]
Transactions of the Association for Computational Linguistics , volume=
Time-aware language models as temporal knowledge bases , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=
2022
-
[18]
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , year =
2019
-
[19]
Transactions of the Association for Computational Linguistics , volume =
Inherent Disagreements in Human Textual Inferences , author =. Transactions of the Association for Computational Linguistics , volume =
-
[20]
Transactions of the Association for Computational Linguistics , volume=
Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=
2019
-
[21]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension , author=. arXiv preprint arXiv:1705.03551 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
-
[23]
Text summarization branches out , pages=
Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
-
[24]
BERTScore: Evaluating Text Generation with BERT
Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[25]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[26]
Proceedings of the 2015 conference on empirical methods in natural language processing , pages=
A large annotated corpus for learning natural language inference , author=. Proceedings of the 2015 conference on empirical methods in natural language processing , pages=
2015
-
[27]
Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
Adversarial NLI: A new benchmark for natural language understanding , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
-
[28]
Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
-
[29]
2024 , publisher=
Statistical inference , author=. 2024 , publisher=
2024
-
[30]
2013 , publisher=
Statistical power analysis for the behavioral sciences , author=. 2013 , publisher=
2013
-
[31]
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , year =
2019
-
[32]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , year =
2019
-
[33]
A Structured Review of the Validity of
Reiter, Ehud , journal =. A Structured Review of the Validity of
-
[34]
Tangled up in
Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor , booktitle =. Tangled up in. 2020 , pages =
2020
-
[35]
and others , journal =
Fabbri, Alexander R. and others , journal =
-
[36]
High Agreement but Low Kappa:
Feinstein, Alvan R and Cicchetti, Domenic V , journal=. High Agreement but Low Kappa:
-
[37]
British Journal of Mathematical and Statistical Psychology61(1), 29–48 (2008)
Gwet, Kilem Li , title =. British Journal of Mathematical and Statistical Psychology , volume =. doi:https://doi.org/10.1348/000711006X126600 , year =
-
[38]
Vu, Tu and Iyyer, Mohit and Wang, Xuezhi and Constant, Noah and Wei, Jerry and Wei, Jason and Tar, Chris and Sung, Yun-Hsuan and Zhou, Denny and Le, Quoc and Luong, Thang , journal=. Fresh
-
[39]
and Choi, Yejin and Inui, Kentaro , booktitle=
Kasai, Jungo and Sakaguchi, Keisuke and Takahashi, Yoichi and Le Bras, Ronan and Asai, Akari and Yu, Xinyan and Radev, Dragomir and Smith, Noah A. and Choi, Yejin and Inui, Kentaro , booktitle=
-
[40]
Proceedings of NeurIPS Datasets and Benchmarks , year=
A Dataset for Answering Time-Sensitive Questions , author=. Proceedings of NeurIPS Datasets and Benchmarks , year=
-
[41]
Streaming
Liska, Adam and Kocisky, Tomas and Gribovskaya, Elena and Terzi, Tayfun and Sezener, Eren and Agrawal, Devang and de Masson d'Autume, Cyprien and Scholtes, Tim and Zaheer, Manzil and Young, Susannah and Gilsenan-McMahon, Ellen and Austin, Sophia and Blunsom, Phil and Lazaridou, Angeliki , booktitle=. Streaming
-
[42]
Advances in neural information processing systems , volume=
Bartscore: Evaluating generated text as text generation , author=. Advances in neural information processing systems , volume=
-
[43]
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
The prompt report: a systematic survey of prompt engineering techniques , author=. arXiv preprint arXiv:2406.06608 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Proceedings of the 24th Annual Conference of the European Association for Machine Translation , pages=
Large language models are state-of-the-art evaluators of translation quality , author=. Proceedings of the 24th Annual Conference of the European Association for Machine Translation , pages=
-
[45]
Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and others , journal=. The
-
[46]
Computing
Krippendorff, Klaus , year=. Computing
-
[47]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging
-
[48]
Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle=
-
[49]
, booktitle=
Sellam, Thibault and Das, Dipanjan and Parikh, Ankur P. , booktitle=
-
[50]
Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , booktitle=
-
[51]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=
With Little Power Comes Great Responsibility , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=
2020
-
[52]
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.