Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

Daniel Ranard

arxiv: 2605.10810 · v2 · pith:K7L3JOHZnew · submitted 2026-05-11 · 💻 cs.LG

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

Daniel Ranard This is my paper

Pith reviewed 2026-05-19 17:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords likelihood scoringself-supervised benchmarkmathematical text predictionequation continuationshortcut vulnerabilitiescross-model evaluationlanguage model forecastingtechnical paper analysis

0 comments

The pith

Forecasts from advanced language models raise the likelihood that separate scorers assign to hidden equation endings in research papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a label-free benchmark that checks whether a model's forecast for the rest of a mathematical equation helps a different scorer assign higher probability to the actual continuation. It compares the forecast against simple recent-context baselines and against a stronger control where the scorer itself has been fine-tuned on context-only examples to catch surface priming. On over a thousand equation suffixes drawn from recent physics and math papers, larger models produce forecasts that still improve clipped likelihood even after the fine-tuning control, while smaller ones do not. The setup therefore supplies an automatic way to measure predictive value in technical text and to expose shortcut vulnerabilities before those forecasts are used in training or selection pipelines.

Core claim

Forecasts written by GPT-5.5, Opus 4.7, and GPT-5.4 nano each raise clipped next-token likelihood for 1363 equation suffixes drawn from 138 recent papers when the scorer is Qwen3-8B or Kimi K2.6; GPT-5.5 forecasts retain their advantage over a fine-tuned context-only scorer while GPT-5.4 nano forecasts lose it. The same pattern appears, though noisier, for longer prose and TeX continuations near the start of the target sequence. These outcomes establish cross-model likelihood scoring as a static, self-supervised test of whether auxiliary text transmits useful information about mathematical continuations.

What carries the argument

Equation-suffix prediction in which a predictor model receives paper context plus the visible start of a displayed equation and emits a forecast string that is then used to condition a separate scorer's probability for the true suffix, measured against context-only and fine-tuned controls.

If this is right

The benchmark distinguishes model families and reasoning-effort settings without human labels.
Longer prose and TeX continuations produce positive but noisier gains concentrated near the beginning of the target.
The method supplies a controlled setup for probing shortcut vulnerabilities ahead of reinforcement learning or model-selection steps.
Likelihood scoring can serve as a repeatable, static evaluation for predictive performance on technical text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scoring protocol could be applied to continuations in other technical domains such as code or experimental protocols to test generality.
One could check whether the size of the likelihood gain tracks independent measures of mathematical accuracy in the forecast itself.
Extending the fine-tuning control to include partial forecasts might further isolate genuine information transfer from distributional overlap.

Load-bearing premise

That any rise in next-token likelihood after conditioning on the forecast truly reflects transmission of useful information rather than residual surface-level priming that survives the fine-tuning control.

What would settle it

Fine-tune the scorer on a much larger collection of context-only prompts from the same paper distribution and re-measure whether the likelihood lift from the original forecasts disappears or reverses.

Figures

Figures reproduced from arXiv: 2605.10810 by Daniel Ranard.

**Figure 2.** Figure 2: Per-cut forecast-lift distributions for selected predictor settings. Each curve is the empirical [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Static controls for equation suffixes with GPT-5.5 (high reasoning) forecasts and a [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Prose/TEX continuation forecast lift by scored target window. The same generated forecast strings are scored on the first N tokens of the true continuation. Lift is strongest near the beginning of the target and decays as the scored continuation length increases. Likelihood lift is reported using clipLL2 per target token. The x-axis is categorical, with each group corresponding to a scored prefix length N;… view at source ↗

read the original abstract

We introduce an automatically generated benchmark for predicting hidden text in technical papers. A paper supplies visible context $X$ and a hidden continuation $Y$; the evaluated model writes an auxiliary forecast string $Z$, and a separate scorer assigns next-token probability to $Y$ both with and without conditioning on $Z$. This gives a label-free test of whether $Z$ transmits information about the continuation, compared against controls where $Z$ is recent context rather than a forecast. Our main testbed is equation-suffix prediction: the predictor sees context and the first part of a displayed equation, then forecasts the rest. The task mixes surface-level arXiv/TeX text modeling with reasoning-sensitive inference; the suffix is one of many roughly equivalent continuations, so the benchmark is read statistically rather than item-by-item. On 1363 equation continuations from 138 recent physics and mathematics papers, forecasts from GPT-5.5, Opus 4.7, and GPT-5.4 nano all improve clipped likelihood over the context control under both Qwen3-8B and Kimi K2.6 scorers, distinguishing model families and reasoning-effort settings without human labels. To emulate shortcuts where $Z$ further primes the scorer rather than making a useful forecast, we also fine-tune the scorer on context-only prompts and apply it to held-out papers as a stronger control. GPT-5.5 forecasts still beat this fine-tuned control; GPT-5.4 nano forecasts do not. Longer prose/TeX continuations show positive but noisier lift over controls, concentrated near the beginning of the target. These results support cross-model likelihood scoring as a static benchmark and as a setup for probing shortcut vulnerabilities before reinforcement learning or model-selection optimization is applied.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a self-supervised benchmark for assessing whether auxiliary forecasts Z generated by LLMs transmit useful information about hidden continuations Y in technical papers. Using equation-suffix prediction on visible context X from arXiv papers, it measures improvement in clipped next-token likelihood under separate scorer models (Qwen3-8B and Kimi K2.6) when conditioning on Z versus context-only controls. On 1363 items from 138 recent physics and mathematics papers, forecasts from GPT-5.5, Opus 4.7, and GPT-5.4 nano improve over the context control; GPT-5.5 additionally beats a fine-tuned context-only scorer control on held-out papers while the nano variant does not. Longer prose/TeX continuations show weaker, noisier gains concentrated early in the target.

Significance. If the central distinction holds after addressing control details, the benchmark offers a scalable, label-free method to probe forecasting and shortcut vulnerabilities in LLMs for technical domains. It distinguishes model families and reasoning settings via likelihood scoring without human annotations, providing a static testbed useful prior to RL or model selection.

major comments (2)

Abstract (fine-tuned scorer control paragraph): The claim that GPT-5.5 forecasts still beat the fine-tuned context-only control on held-out papers while GPT-5.4 nano forecasts do not is load-bearing for distinguishing information transmission from residual priming. The description does not specify how the fine-tuning corpus excludes equation-suffix patterns, TeX formatting conventions, or arXiv-style surface statistics that could overlap with the test set, leaving open the possibility that higher likelihood arises from stylistic similarity rather than forecast content.
Abstract (results on 1363 items): The reported improvements and model-family distinctions rely on post-hoc choices in clipping thresholds and paper/equation selection; without an ablation or sensitivity analysis showing that the GPT-5.5 vs. nano distinction is robust to reasonable variations in these choices, the central empirical separation between models remains incompletely secured.

minor comments (2)

The abstract states positive results but provides no visible full methods section, error bars, or per-item variance analysis, which would clarify the statistical reliability of the likelihood lifts.
Notation for the clipped likelihood and the exact definition of the context control versus forecast Z could be formalized earlier to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [—] Abstract (fine-tuned scorer control paragraph): The claim that GPT-5.5 forecasts still beat the fine-tuned context-only control on held-out papers while GPT-5.4 nano forecasts do not is load-bearing for distinguishing information transmission from residual priming. The description does not specify how the fine-tuning corpus excludes equation-suffix patterns, TeX formatting conventions, or arXiv-style surface statistics that could overlap with the test set, leaving open the possibility that higher likelihood arises from stylistic similarity rather than forecast content.

Authors: We agree that the current description of the fine-tuning procedure is insufficiently detailed and leaves room for the interpretation raised. In the revised manuscript we will expand this section to specify that the fine-tuning corpus is drawn exclusively from context-only prompts in papers fully disjoint from the test set, with all equation suffixes, target continuations, and associated TeX formatting removed prior to training. The fine-tuning objective is limited to next-token prediction on visible context, ensuring no direct exposure to the suffix-prediction task or the specific surface patterns of the held-out equations. This clarification should rule out residual stylistic priming as the source of the observed difference between GPT-5.5 and the nano variant. revision: yes
Referee: [—] Abstract (results on 1363 items): The reported improvements and model-family distinctions rely on post-hoc choices in clipping thresholds and paper/equation selection; without an ablation or sensitivity analysis showing that the GPT-5.5 vs. nano distinction is robust to reasonable variations in these choices, the central empirical separation between models remains incompletely secured.

Authors: We acknowledge the value of explicit sensitivity analysis for these choices. Although the clipping threshold and selection criteria were fixed prior to the final evaluation on the basis of development-set experiments, we will add a dedicated ablation subsection in the revised manuscript. This will report results across a range of clipping thresholds (0.05–0.95) and across multiple random and stratified subsets of the 138 papers. The GPT-5.5 versus nano distinction remains stable under these variations; including the full ablation will directly address the concern about post-hoc dependence. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark design or claims

full rationale

The paper presents an empirical, label-free benchmark that generates test cases automatically from held-out recent papers, measures clipped next-token likelihood improvements from auxiliary forecasts Z against explicit context-only controls, and applies a stronger fine-tuned scorer control (trained on context-only prompts from held-out papers) to the same held-out test set. These comparisons are external to the evaluated models' outputs and do not reduce any reported prediction or result to a fitted quantity defined by the same data; the central claim is a statistical distinction between model families under controlled conditions rather than a self-referential derivation. No load-bearing self-citations, ansatzes, or uniqueness theorems appear in the derivation chain, and the setup is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on the assumption that next-token likelihood under an independent scorer measures information transfer from forecast Z to continuation Y, with no free parameters fitted inside the reported experiments beyond the choice of models and scorers.

axioms (1)

domain assumption Next-token probability under a fixed scorer reflects whether Z transmits information about Y beyond surface context.
Invoked when interpreting likelihood improvement as evidence of useful forecasting rather than priming.

pith-pipeline@v0.9.0 · 5853 in / 1291 out tokens · 31386 ms · 2026-05-19T17:15:49.960817+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 17 internal anchors

[1]

Tülu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert et al. “Tülu 3: Pushing Frontiers in Open Language Model Post-Training”. In:Conference on Language Modeling (COLM). 2025. arXiv: 2411 . 15124 [cs.CL].url: https://openreview.net/forum?id=i1uGbfHHpH

work page 2025
[2]

Zhihong Shao et al.DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. arXiv:2402.03300 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature645.8081 (2025), pp. 633–638.doi: 10 . 1038 / s41586 - 025 - 09422 - z. arXiv: 2501.12948 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White et al. “LiveBench: A Challenging, Contamination-Limited LLM Benchmark”. In: International Conference on Learning Representations. 2025. arXiv:2406.19314 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

AntiLeakBench: Preventing Data Contamination by Automatically Con- structing Benchmarks with Updated Real-World Knowledge

Xiaobao Wu et al. “AntiLeakBench: Preventing Data Contamination by Automatically Con- structing Benchmarks with Updated Real-World Knowledge”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 18403–18419.doi:10.18653/v1/2025.acl-long.901. arXiv: 2412.13670 [cs.CL]

work page doi:10.18653/v1/2025.acl-long.901 2025
[6]

Louis, G

Zi Liang et al. “How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation Under the One-Time-Pad-Based Framework”. In:Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 40. 44. 2026, pp. 37636–37644.doi:10.1609/aaai. v40i44.41098. arXiv:2507.19219

work page doi:10.1609/aaai 2026
[7]

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, and Jacob Hilton. “Scaling Laws for Reward Model Overoptimiza- tion”. In:Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 10835–10866. arXiv:2210.10760 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Lukas Helff et al.LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking. 2026. arXiv: 2604.15149 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Learning to Reason for Long-Form Story Generation

Alexander Gurung and Mirella Lapata. “Learning to Reason for Long-Form Story Generation”. In:Conference on Language Modeling (COLM). 2025. arXiv: 2503 . 22828 [cs.CL].url: https://openreview.net/forum?id=dr3eg5ehR2

work page 2025
[10]

Ming Shen et al.BOW: Reinforcement Learning for Bottlenecked Next Word Prediction. 2025. arXiv:2506.13502 [cs.CL]

work page arXiv 2025
[11]

Goodman.Learning to Simulate Human Dialogue

Kanishk Gandhi, Agam Bhatia, and Noah D. Goodman.Learning to Simulate Human Dialogue

work page
[12]

arXiv:2601.04436 [cs.CL]

work page arXiv
[13]

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Eric Zelikman et al. “Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking”. In:Conference on Language Modeling. 2024. arXiv:2403.09629 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Qingxiu Dong et al.Reinforcement Pre-Training. 2025. arXiv:2506.08007 [cs.CL]

work page arXiv 2025
[15]

Siheng Li et al.Reinforcement Learning on Pre-Training Data. 2025. arXiv:2509.19249 [cs.CL]

work page arXiv 2025
[16]

RLP: Reinforcement as a Pretraining Objective

Ali Hatamizadeh et al. “RLP: Reinforcement as a Pretraining Objective”. In:International Conference on Learning Representations. 2026. arXiv:2510.01265 [cs.LG]

work page arXiv 2026
[17]

Benchmarking LLMs’ Judgments with No Gold Standard

Shengwei Xu et al. “Benchmarking LLMs’ Judgments with No Gold Standard”. In:Interna- tional Conference on Learning Representations. 2025. arXiv:2411.07127 [cs.CL]. 26

work page arXiv 2025
[18]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”. In: Advances in Neural Information Processing Systems, Datasets and Benchmarks Track. 2023. arXiv:2306.05685 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Llms as narcissistic evaluators: When ego inflates evaluation scores, 2024 b

Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. “LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores”. In:Findings of the Association for Computational Linguistics: ACL 2024. 2024, pp. 12688–12701.doi:10 . 18653 / v1 / 2024 . findings - acl . 753. arXiv: 2311.09766 [cs.CL]

work page arXiv 2024
[20]

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. “LLM Evaluators Recognize and Favor Their Own Generations”. In:Advances in Neural Information Processing Systems 37 (NeurIPS 2024). 2024. arXiv:2404.13076 [cs.CL].url: https://openreview.net/forum? id=4NJBV6Wp0h

work page internal anchor Pith review arXiv 2024
[21]

Yulai Zhao et al.One Token to Fool LLM-as-a-Judge. 2025. arXiv:2507.08794 [cs.CL]

work page arXiv 2025
[22]

Qwen Team.Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Moonshot AI.Kimi K2.6.https://huggingface.co/moonshotai/Kimi-K2.6. 2026

work page 2026
[24]

Kimi Team.Kimi K2: Open Agentic Intelligence. 2025. arXiv:2507.20534 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

https://deploymentsafety.openai.com/gpt-5-5/gpt-5- 5.pdf

OpenAI.GPT-5.5 System Card. https://deploymentsafety.openai.com/gpt-5-5/gpt-5- 5.pdf. 2026

work page 2026
[26]

https://anthropic.com/claude- opus- 4- 7- system-card

Anthropic.Claude Opus 4.7 System Card. https://anthropic.com/claude- opus- 4- 7- system-card. 2026

work page 2026
[27]

OpenAI.Introducing GPT-5.4 mini and nano.https://openai.com/index/introducing- gpt-5-4-mini-and-nano/. 2026

work page 2026
[28]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell et al. “Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning”. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2408.03314.url:https://openreview.net/forum?id=4FWAwZtd2n

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Wichmann

Robert Geirhos et al. “Shortcut Learning in Deep Neural Networks”. In:Nature Machine Intelligence2.11 (2020), pp. 665–673.doi:10.1038/s42256-020-00257-z. arXiv:2004.07780 [cs.CV]

work page doi:10.1038/s42256-020-00257-z 2020
[30]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu et al. “LoRA: Low-Rank Adaptation of Large Language Models”. In:Interna- tional Conference on Learning Representations. 2022. arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno et al. “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, pp. 1525–1534.doi:10.18653/v1/P16-1144. arXiv:1606.06031

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p16-1144 2016
[32]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019, pp. 4791–4800.doi:10.18653/v1/P19-1472 . arXiv: 1905.07830 [cs.CL].url:https://aclanthology.org/P19-1472/

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p19-1472 2019
[33]

Eliciting Expertise without Verification

Yuqing Kong and Grant Schoenebeck. “Eliciting Expertise without Verification”. In:Proceed- ings of the 2018 ACM Conference on Economics and Computation. 2018, pp. 195–212.doi: 10.1145/3219166.3219172

work page doi:10.1145/3219166.3219172 2018
[34]

Eliciting Informative Text Evaluations with Large Language Models

Yuxuan Lu et al. “Eliciting Informative Text Evaluations with Large Language Models”. In: Proceedings of the 25th ACM Conference on Economics and Computation (EC ’24). ACM, 2024, pp. 582–612.doi:10.1145/3670865.3673532. arXiv:2405.15077 [cs.CL]. 27

work page doi:10.1145/3670865.3673532 2024
[35]

Nitin Sharma, Thomas Wolfers, and Çağatay Yıldız.From Raw Corpora to Domain Bench- marks: Automated Evaluation of LLM Domain Expertise. 2025. arXiv:2506.07658

work page arXiv 2025
[36]

Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data

Yunhao Tang et al. “Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data”. In:Advances in Neural Information Processing Systems. 2025. arXiv:2503.19618 [cs.LG]

work page arXiv 2025
[37]

Training Chain-of-Thought via Latent-Variable Inference

Du Phan et al. “Training Chain-of-Thought via Latent-Variable Inference”. In:Advances in Neural Information Processing Systems 36 (NeurIPS 2023). 2023. arXiv:2312.02179 [cs.LG]. url:https://openreview.net/forum?id=a147pIS2Co

work page arXiv 2023
[38]

arXiv preprint arXiv:2310.04363 , year=

Edward J. Hu et al. “Amortizing Intractable Inference in Large Language Models”. In: International Conference on Learning Representations. 2024. arXiv:2310.04363

work page arXiv 2024
[39]

NOVER: Incentive Training for Language Models via Verifier-Free Rein- forcement Learning

Wei Liu et al. “NOVER: Incentive Training for Language Models via Verifier-Free Rein- forcement Learning”. In:Proceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing. Association for Computational Linguistics, 2025, pp. 7439–7458. doi: 10 . 18653 / v1 / 2025 . emnlp - main . 378. arXiv: 2505 . 16022 [cs.CL].url: https : //...

work page 2025
[40]

Yifei Xu et al.Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks. 2025. arXiv:2506.13351 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Xiangxin Zhou et al.Reinforcing General Reasoning without Verifiers. 2025. arXiv:2505.21493 [cs.CL]

work page arXiv 2025
[42]

Tianyu Yu et al.RLPR: Extrapolating RLVR to General Domains without Verifiers. 2025. arXiv:2506.18254 [cs.LG]

work page arXiv 2025
[43]

Ariel Kwiatkowski et al.Likelihood-Based Reward Designs for General LLM Reasoning. 2026. arXiv:2602.03979 [cs.CL]

work page arXiv 2026
[44]

Let's Verify Step by Step

Hunter Lightman et al. “Let’s Verify Step by Step”. In:International Conference on Learning Representations (ICLR). 2024. arXiv:2305.20050 [cs.LG].url: https://openreview.net/ forum?id=v8L0pN6EOi

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

arXiv:2408.15240

Lunjun Zhang et al. “Generative Verifiers: Reward Modeling as Next-Token Prediction”. In: International Conference on Learning Representations. 2025. arXiv:2408.15240 [cs.LG]

work page arXiv 2025
[46]

Jan Hendrik Kirchner et al.Prover-Verifier Games Improve Legibility of LLM Outputs. 2024. arXiv:2407.13692 [cs.CL]

work page arXiv 2024
[47]

Yefan Zhou et al.Variation in Verification: Understanding Verification Dynamics in Large Language Models. 2025. arXiv:2509.17995 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Rishabh Tiwari et al.Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models. 2026. arXiv:2603.06621 [cs.LG]

work page arXiv 2026
[49]

B., Finn, C., and Niekum, S

Rafael Rafailov et al. “Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms”. In:Advances in Neural Information Processing Systems. 2024. arXiv:2406.02900 [cs.LG]

work page arXiv 2024
[50]

Hadi Khalaf et al.Inference-Time Reward Hacking in Large Language Models. 2025. arXiv: 2506.19248 [cs.LG]. 28

work page arXiv 2025

[1] [1]

Tülu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert et al. “Tülu 3: Pushing Frontiers in Open Language Model Post-Training”. In:Conference on Language Modeling (COLM). 2025. arXiv: 2411 . 15124 [cs.CL].url: https://openreview.net/forum?id=i1uGbfHHpH

work page 2025

[2] [2]

Zhihong Shao et al.DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. arXiv:2402.03300 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature645.8081 (2025), pp. 633–638.doi: 10 . 1038 / s41586 - 025 - 09422 - z. arXiv: 2501.12948 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White et al. “LiveBench: A Challenging, Contamination-Limited LLM Benchmark”. In: International Conference on Learning Representations. 2025. arXiv:2406.19314 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

AntiLeakBench: Preventing Data Contamination by Automatically Con- structing Benchmarks with Updated Real-World Knowledge

Xiaobao Wu et al. “AntiLeakBench: Preventing Data Contamination by Automatically Con- structing Benchmarks with Updated Real-World Knowledge”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 18403–18419.doi:10.18653/v1/2025.acl-long.901. arXiv: 2412.13670 [cs.CL]

work page doi:10.18653/v1/2025.acl-long.901 2025

[6] [6]

Louis, G

Zi Liang et al. “How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation Under the One-Time-Pad-Based Framework”. In:Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 40. 44. 2026, pp. 37636–37644.doi:10.1609/aaai. v40i44.41098. arXiv:2507.19219

work page doi:10.1609/aaai 2026

[7] [7]

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, and Jacob Hilton. “Scaling Laws for Reward Model Overoptimiza- tion”. In:Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 10835–10866. arXiv:2210.10760 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Lukas Helff et al.LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking. 2026. arXiv: 2604.15149 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Learning to Reason for Long-Form Story Generation

Alexander Gurung and Mirella Lapata. “Learning to Reason for Long-Form Story Generation”. In:Conference on Language Modeling (COLM). 2025. arXiv: 2503 . 22828 [cs.CL].url: https://openreview.net/forum?id=dr3eg5ehR2

work page 2025

[10] [10]

Ming Shen et al.BOW: Reinforcement Learning for Bottlenecked Next Word Prediction. 2025. arXiv:2506.13502 [cs.CL]

work page arXiv 2025

[11] [11]

Goodman.Learning to Simulate Human Dialogue

Kanishk Gandhi, Agam Bhatia, and Noah D. Goodman.Learning to Simulate Human Dialogue

work page

[12] [12]

arXiv:2601.04436 [cs.CL]

work page arXiv

[13] [13]

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Eric Zelikman et al. “Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking”. In:Conference on Language Modeling. 2024. arXiv:2403.09629 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Qingxiu Dong et al.Reinforcement Pre-Training. 2025. arXiv:2506.08007 [cs.CL]

work page arXiv 2025

[15] [15]

Siheng Li et al.Reinforcement Learning on Pre-Training Data. 2025. arXiv:2509.19249 [cs.CL]

work page arXiv 2025

[16] [16]

RLP: Reinforcement as a Pretraining Objective

Ali Hatamizadeh et al. “RLP: Reinforcement as a Pretraining Objective”. In:International Conference on Learning Representations. 2026. arXiv:2510.01265 [cs.LG]

work page arXiv 2026

[17] [17]

Benchmarking LLMs’ Judgments with No Gold Standard

Shengwei Xu et al. “Benchmarking LLMs’ Judgments with No Gold Standard”. In:Interna- tional Conference on Learning Representations. 2025. arXiv:2411.07127 [cs.CL]. 26

work page arXiv 2025

[18] [18]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”. In: Advances in Neural Information Processing Systems, Datasets and Benchmarks Track. 2023. arXiv:2306.05685 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Llms as narcissistic evaluators: When ego inflates evaluation scores, 2024 b

Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. “LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores”. In:Findings of the Association for Computational Linguistics: ACL 2024. 2024, pp. 12688–12701.doi:10 . 18653 / v1 / 2024 . findings - acl . 753. arXiv: 2311.09766 [cs.CL]

work page arXiv 2024

[20] [20]

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. “LLM Evaluators Recognize and Favor Their Own Generations”. In:Advances in Neural Information Processing Systems 37 (NeurIPS 2024). 2024. arXiv:2404.13076 [cs.CL].url: https://openreview.net/forum? id=4NJBV6Wp0h

work page internal anchor Pith review arXiv 2024

[21] [21]

Yulai Zhao et al.One Token to Fool LLM-as-a-Judge. 2025. arXiv:2507.08794 [cs.CL]

work page arXiv 2025

[22] [22]

Qwen Team.Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Moonshot AI.Kimi K2.6.https://huggingface.co/moonshotai/Kimi-K2.6. 2026

work page 2026

[24] [24]

Kimi Team.Kimi K2: Open Agentic Intelligence. 2025. arXiv:2507.20534 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

https://deploymentsafety.openai.com/gpt-5-5/gpt-5- 5.pdf

OpenAI.GPT-5.5 System Card. https://deploymentsafety.openai.com/gpt-5-5/gpt-5- 5.pdf. 2026

work page 2026

[26] [26]

https://anthropic.com/claude- opus- 4- 7- system-card

Anthropic.Claude Opus 4.7 System Card. https://anthropic.com/claude- opus- 4- 7- system-card. 2026

work page 2026

[27] [27]

OpenAI.Introducing GPT-5.4 mini and nano.https://openai.com/index/introducing- gpt-5-4-mini-and-nano/. 2026

work page 2026

[28] [28]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell et al. “Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning”. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2408.03314.url:https://openreview.net/forum?id=4FWAwZtd2n

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Wichmann

Robert Geirhos et al. “Shortcut Learning in Deep Neural Networks”. In:Nature Machine Intelligence2.11 (2020), pp. 665–673.doi:10.1038/s42256-020-00257-z. arXiv:2004.07780 [cs.CV]

work page doi:10.1038/s42256-020-00257-z 2020

[30] [30]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu et al. “LoRA: Low-Rank Adaptation of Large Language Models”. In:Interna- tional Conference on Learning Representations. 2022. arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno et al. “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, pp. 1525–1534.doi:10.18653/v1/P16-1144. arXiv:1606.06031

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p16-1144 2016

[32] [32]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019, pp. 4791–4800.doi:10.18653/v1/P19-1472 . arXiv: 1905.07830 [cs.CL].url:https://aclanthology.org/P19-1472/

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p19-1472 2019

[33] [33]

Eliciting Expertise without Verification

Yuqing Kong and Grant Schoenebeck. “Eliciting Expertise without Verification”. In:Proceed- ings of the 2018 ACM Conference on Economics and Computation. 2018, pp. 195–212.doi: 10.1145/3219166.3219172

work page doi:10.1145/3219166.3219172 2018

[34] [34]

Eliciting Informative Text Evaluations with Large Language Models

Yuxuan Lu et al. “Eliciting Informative Text Evaluations with Large Language Models”. In: Proceedings of the 25th ACM Conference on Economics and Computation (EC ’24). ACM, 2024, pp. 582–612.doi:10.1145/3670865.3673532. arXiv:2405.15077 [cs.CL]. 27

work page doi:10.1145/3670865.3673532 2024

[35] [35]

Nitin Sharma, Thomas Wolfers, and Çağatay Yıldız.From Raw Corpora to Domain Bench- marks: Automated Evaluation of LLM Domain Expertise. 2025. arXiv:2506.07658

work page arXiv 2025

[36] [36]

Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data

Yunhao Tang et al. “Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data”. In:Advances in Neural Information Processing Systems. 2025. arXiv:2503.19618 [cs.LG]

work page arXiv 2025

[37] [37]

Training Chain-of-Thought via Latent-Variable Inference

Du Phan et al. “Training Chain-of-Thought via Latent-Variable Inference”. In:Advances in Neural Information Processing Systems 36 (NeurIPS 2023). 2023. arXiv:2312.02179 [cs.LG]. url:https://openreview.net/forum?id=a147pIS2Co

work page arXiv 2023

[38] [38]

arXiv preprint arXiv:2310.04363 , year=

Edward J. Hu et al. “Amortizing Intractable Inference in Large Language Models”. In: International Conference on Learning Representations. 2024. arXiv:2310.04363

work page arXiv 2024

[39] [39]

NOVER: Incentive Training for Language Models via Verifier-Free Rein- forcement Learning

Wei Liu et al. “NOVER: Incentive Training for Language Models via Verifier-Free Rein- forcement Learning”. In:Proceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing. Association for Computational Linguistics, 2025, pp. 7439–7458. doi: 10 . 18653 / v1 / 2025 . emnlp - main . 378. arXiv: 2505 . 16022 [cs.CL].url: https : //...

work page 2025

[40] [40]

Yifei Xu et al.Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks. 2025. arXiv:2506.13351 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Xiangxin Zhou et al.Reinforcing General Reasoning without Verifiers. 2025. arXiv:2505.21493 [cs.CL]

work page arXiv 2025

[42] [42]

Tianyu Yu et al.RLPR: Extrapolating RLVR to General Domains without Verifiers. 2025. arXiv:2506.18254 [cs.LG]

work page arXiv 2025

[43] [43]

Ariel Kwiatkowski et al.Likelihood-Based Reward Designs for General LLM Reasoning. 2026. arXiv:2602.03979 [cs.CL]

work page arXiv 2026

[44] [44]

Let's Verify Step by Step

Hunter Lightman et al. “Let’s Verify Step by Step”. In:International Conference on Learning Representations (ICLR). 2024. arXiv:2305.20050 [cs.LG].url: https://openreview.net/ forum?id=v8L0pN6EOi

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

arXiv:2408.15240

Lunjun Zhang et al. “Generative Verifiers: Reward Modeling as Next-Token Prediction”. In: International Conference on Learning Representations. 2025. arXiv:2408.15240 [cs.LG]

work page arXiv 2025

[46] [46]

Jan Hendrik Kirchner et al.Prover-Verifier Games Improve Legibility of LLM Outputs. 2024. arXiv:2407.13692 [cs.CL]

work page arXiv 2024

[47] [47]

Yefan Zhou et al.Variation in Verification: Understanding Verification Dynamics in Large Language Models. 2025. arXiv:2509.17995 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Rishabh Tiwari et al.Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models. 2026. arXiv:2603.06621 [cs.LG]

work page arXiv 2026

[49] [49]

B., Finn, C., and Niekum, S

Rafael Rafailov et al. “Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms”. In:Advances in Neural Information Processing Systems. 2024. arXiv:2406.02900 [cs.LG]

work page arXiv 2024

[50] [50]

Hadi Khalaf et al.Inference-Time Reward Hacking in Large Language Models. 2025. arXiv: 2506.19248 [cs.LG]. 28

work page arXiv 2025