Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
Pith reviewed 2026-05-19 17:15 UTC · model grok-4.3
The pith
Forecasts from advanced language models raise the likelihood that separate scorers assign to hidden equation endings in research papers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Forecasts written by GPT-5.5, Opus 4.7, and GPT-5.4 nano each raise clipped next-token likelihood for 1363 equation suffixes drawn from 138 recent papers when the scorer is Qwen3-8B or Kimi K2.6; GPT-5.5 forecasts retain their advantage over a fine-tuned context-only scorer while GPT-5.4 nano forecasts lose it. The same pattern appears, though noisier, for longer prose and TeX continuations near the start of the target sequence. These outcomes establish cross-model likelihood scoring as a static, self-supervised test of whether auxiliary text transmits useful information about mathematical continuations.
What carries the argument
Equation-suffix prediction in which a predictor model receives paper context plus the visible start of a displayed equation and emits a forecast string that is then used to condition a separate scorer's probability for the true suffix, measured against context-only and fine-tuned controls.
If this is right
- The benchmark distinguishes model families and reasoning-effort settings without human labels.
- Longer prose and TeX continuations produce positive but noisier gains concentrated near the beginning of the target.
- The method supplies a controlled setup for probing shortcut vulnerabilities ahead of reinforcement learning or model-selection steps.
- Likelihood scoring can serve as a repeatable, static evaluation for predictive performance on technical text.
Where Pith is reading between the lines
- The same scoring protocol could be applied to continuations in other technical domains such as code or experimental protocols to test generality.
- One could check whether the size of the likelihood gain tracks independent measures of mathematical accuracy in the forecast itself.
- Extending the fine-tuning control to include partial forecasts might further isolate genuine information transfer from distributional overlap.
Load-bearing premise
That any rise in next-token likelihood after conditioning on the forecast truly reflects transmission of useful information rather than residual surface-level priming that survives the fine-tuning control.
What would settle it
Fine-tune the scorer on a much larger collection of context-only prompts from the same paper distribution and re-measure whether the likelihood lift from the original forecasts disappears or reverses.
Figures
read the original abstract
We introduce an automatically generated benchmark for predicting hidden text in technical papers. A paper supplies visible context $X$ and a hidden continuation $Y$; the evaluated model writes an auxiliary forecast string $Z$, and a separate scorer assigns next-token probability to $Y$ both with and without conditioning on $Z$. This gives a label-free test of whether $Z$ transmits information about the continuation, compared against controls where $Z$ is recent context rather than a forecast. Our main testbed is equation-suffix prediction: the predictor sees context and the first part of a displayed equation, then forecasts the rest. The task mixes surface-level arXiv/TeX text modeling with reasoning-sensitive inference; the suffix is one of many roughly equivalent continuations, so the benchmark is read statistically rather than item-by-item. On 1363 equation continuations from 138 recent physics and mathematics papers, forecasts from GPT-5.5, Opus 4.7, and GPT-5.4 nano all improve clipped likelihood over the context control under both Qwen3-8B and Kimi K2.6 scorers, distinguishing model families and reasoning-effort settings without human labels. To emulate shortcuts where $Z$ further primes the scorer rather than making a useful forecast, we also fine-tune the scorer on context-only prompts and apply it to held-out papers as a stronger control. GPT-5.5 forecasts still beat this fine-tuned control; GPT-5.4 nano forecasts do not. Longer prose/TeX continuations show positive but noisier lift over controls, concentrated near the beginning of the target. These results support cross-model likelihood scoring as a static benchmark and as a setup for probing shortcut vulnerabilities before reinforcement learning or model-selection optimization is applied.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a self-supervised benchmark for assessing whether auxiliary forecasts Z generated by LLMs transmit useful information about hidden continuations Y in technical papers. Using equation-suffix prediction on visible context X from arXiv papers, it measures improvement in clipped next-token likelihood under separate scorer models (Qwen3-8B and Kimi K2.6) when conditioning on Z versus context-only controls. On 1363 items from 138 recent physics and mathematics papers, forecasts from GPT-5.5, Opus 4.7, and GPT-5.4 nano improve over the context control; GPT-5.5 additionally beats a fine-tuned context-only scorer control on held-out papers while the nano variant does not. Longer prose/TeX continuations show weaker, noisier gains concentrated early in the target.
Significance. If the central distinction holds after addressing control details, the benchmark offers a scalable, label-free method to probe forecasting and shortcut vulnerabilities in LLMs for technical domains. It distinguishes model families and reasoning settings via likelihood scoring without human annotations, providing a static testbed useful prior to RL or model selection.
major comments (2)
- Abstract (fine-tuned scorer control paragraph): The claim that GPT-5.5 forecasts still beat the fine-tuned context-only control on held-out papers while GPT-5.4 nano forecasts do not is load-bearing for distinguishing information transmission from residual priming. The description does not specify how the fine-tuning corpus excludes equation-suffix patterns, TeX formatting conventions, or arXiv-style surface statistics that could overlap with the test set, leaving open the possibility that higher likelihood arises from stylistic similarity rather than forecast content.
- Abstract (results on 1363 items): The reported improvements and model-family distinctions rely on post-hoc choices in clipping thresholds and paper/equation selection; without an ablation or sensitivity analysis showing that the GPT-5.5 vs. nano distinction is robust to reasonable variations in these choices, the central empirical separation between models remains incompletely secured.
minor comments (2)
- The abstract states positive results but provides no visible full methods section, error bars, or per-item variance analysis, which would clarify the statistical reliability of the likelihood lifts.
- Notation for the clipped likelihood and the exact definition of the context control versus forecast Z could be formalized earlier to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [—] Abstract (fine-tuned scorer control paragraph): The claim that GPT-5.5 forecasts still beat the fine-tuned context-only control on held-out papers while GPT-5.4 nano forecasts do not is load-bearing for distinguishing information transmission from residual priming. The description does not specify how the fine-tuning corpus excludes equation-suffix patterns, TeX formatting conventions, or arXiv-style surface statistics that could overlap with the test set, leaving open the possibility that higher likelihood arises from stylistic similarity rather than forecast content.
Authors: We agree that the current description of the fine-tuning procedure is insufficiently detailed and leaves room for the interpretation raised. In the revised manuscript we will expand this section to specify that the fine-tuning corpus is drawn exclusively from context-only prompts in papers fully disjoint from the test set, with all equation suffixes, target continuations, and associated TeX formatting removed prior to training. The fine-tuning objective is limited to next-token prediction on visible context, ensuring no direct exposure to the suffix-prediction task or the specific surface patterns of the held-out equations. This clarification should rule out residual stylistic priming as the source of the observed difference between GPT-5.5 and the nano variant. revision: yes
-
Referee: [—] Abstract (results on 1363 items): The reported improvements and model-family distinctions rely on post-hoc choices in clipping thresholds and paper/equation selection; without an ablation or sensitivity analysis showing that the GPT-5.5 vs. nano distinction is robust to reasonable variations in these choices, the central empirical separation between models remains incompletely secured.
Authors: We acknowledge the value of explicit sensitivity analysis for these choices. Although the clipping threshold and selection criteria were fixed prior to the final evaluation on the basis of development-set experiments, we will add a dedicated ablation subsection in the revised manuscript. This will report results across a range of clipping thresholds (0.05–0.95) and across multiple random and stratified subsets of the 138 papers. The GPT-5.5 versus nano distinction remains stable under these variations; including the full ablation will directly address the concern about post-hoc dependence. revision: yes
Circularity Check
No significant circularity in benchmark design or claims
full rationale
The paper presents an empirical, label-free benchmark that generates test cases automatically from held-out recent papers, measures clipped next-token likelihood improvements from auxiliary forecasts Z against explicit context-only controls, and applies a stronger fine-tuned scorer control (trained on context-only prompts from held-out papers) to the same held-out test set. These comparisons are external to the evaluated models' outputs and do not reduce any reported prediction or result to a fitted quantity defined by the same data; the central claim is a statistical distinction between model families under controlled conditions rather than a self-referential derivation. No load-bearing self-citations, ansatzes, or uniqueness theorems appear in the derivation chain, and the setup is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Next-token probability under a fixed scorer reflects whether Z transmits information about Y beyond surface context.
Reference graph
Works this paper leans on
-
[1]
Tülu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert et al. “Tülu 3: Pushing Frontiers in Open Language Model Post-Training”. In:Conference on Language Modeling (COLM). 2025. arXiv: 2411 . 15124 [cs.CL].url: https://openreview.net/forum?id=i1uGbfHHpH
work page 2025
-
[2]
Zhihong Shao et al.DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. arXiv:2402.03300 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature645.8081 (2025), pp. 633–638.doi: 10 . 1038 / s41586 - 025 - 09422 - z. arXiv: 2501.12948 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White et al. “LiveBench: A Challenging, Contamination-Limited LLM Benchmark”. In: International Conference on Learning Representations. 2025. arXiv:2406.19314 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Xiaobao Wu et al. “AntiLeakBench: Preventing Data Contamination by Automatically Con- structing Benchmarks with Updated Real-World Knowledge”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 18403–18419.doi:10.18653/v1/2025.acl-long.901. arXiv: 2412.13670 [cs.CL]
-
[6]
Zi Liang et al. “How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation Under the One-Time-Pad-Based Framework”. In:Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 40. 44. 2026, pp. 37636–37644.doi:10.1609/aaai. v40i44.41098. arXiv:2507.19219
-
[7]
Scaling Laws for Reward Model Overoptimization
Leo Gao, John Schulman, and Jacob Hilton. “Scaling Laws for Reward Model Overoptimiza- tion”. In:Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 10835–10866. arXiv:2210.10760 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Lukas Helff et al.LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking. 2026. arXiv: 2604.15149 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Learning to Reason for Long-Form Story Generation
Alexander Gurung and Mirella Lapata. “Learning to Reason for Long-Form Story Generation”. In:Conference on Language Modeling (COLM). 2025. arXiv: 2503 . 22828 [cs.CL].url: https://openreview.net/forum?id=dr3eg5ehR2
work page 2025
- [10]
-
[11]
Goodman.Learning to Simulate Human Dialogue
Kanishk Gandhi, Agam Bhatia, and Noah D. Goodman.Learning to Simulate Human Dialogue
- [12]
-
[13]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Eric Zelikman et al. “Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking”. In:Conference on Language Modeling. 2024. arXiv:2403.09629 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [14]
- [15]
-
[16]
RLP: Reinforcement as a Pretraining Objective
Ali Hatamizadeh et al. “RLP: Reinforcement as a Pretraining Objective”. In:International Conference on Learning Representations. 2026. arXiv:2510.01265 [cs.LG]
-
[17]
Benchmarking LLMs’ Judgments with No Gold Standard
Shengwei Xu et al. “Benchmarking LLMs’ Judgments with No Gold Standard”. In:Interna- tional Conference on Learning Representations. 2025. arXiv:2411.07127 [cs.CL]. 26
-
[18]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”. In: Advances in Neural Information Processing Systems, Datasets and Benchmarks Track. 2023. arXiv:2306.05685 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Llms as narcissistic evaluators: When ego inflates evaluation scores, 2024 b
Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. “LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores”. In:Findings of the Association for Computational Linguistics: ACL 2024. 2024, pp. 12688–12701.doi:10 . 18653 / v1 / 2024 . findings - acl . 753. arXiv: 2311.09766 [cs.CL]
-
[20]
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. “LLM Evaluators Recognize and Favor Their Own Generations”. In:Advances in Neural Information Processing Systems 37 (NeurIPS 2024). 2024. arXiv:2404.13076 [cs.CL].url: https://openreview.net/forum? id=4NJBV6Wp0h
work page internal anchor Pith review arXiv 2024
- [21]
-
[22]
Qwen Team.Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Moonshot AI.Kimi K2.6.https://huggingface.co/moonshotai/Kimi-K2.6. 2026
work page 2026
-
[24]
Kimi Team.Kimi K2: Open Agentic Intelligence. 2025. arXiv:2507.20534 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
https://deploymentsafety.openai.com/gpt-5-5/gpt-5- 5.pdf
OpenAI.GPT-5.5 System Card. https://deploymentsafety.openai.com/gpt-5-5/gpt-5- 5.pdf. 2026
work page 2026
-
[26]
https://anthropic.com/claude- opus- 4- 7- system-card
Anthropic.Claude Opus 4.7 System Card. https://anthropic.com/claude- opus- 4- 7- system-card. 2026
work page 2026
-
[27]
OpenAI.Introducing GPT-5.4 mini and nano.https://openai.com/index/introducing- gpt-5-4-mini-and-nano/. 2026
work page 2026
-
[28]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell et al. “Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning”. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2408.03314.url:https://openreview.net/forum?id=4FWAwZtd2n
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Robert Geirhos et al. “Shortcut Learning in Deep Neural Networks”. In:Nature Machine Intelligence2.11 (2020), pp. 665–673.doi:10.1038/s42256-020-00257-z. arXiv:2004.07780 [cs.CV]
-
[30]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu et al. “LoRA: Low-Rank Adaptation of Large Language Models”. In:Interna- tional Conference on Learning Representations. 2022. arXiv:2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno et al. “The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context”. In:Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, pp. 1525–1534.doi:10.18653/v1/P16-1144. arXiv:1606.06031
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p16-1144 2016
-
[32]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019, pp. 4791–4800.doi:10.18653/v1/P19-1472 . arXiv: 1905.07830 [cs.CL].url:https://aclanthology.org/P19-1472/
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p19-1472 2019
-
[33]
Eliciting Expertise without Verification
Yuqing Kong and Grant Schoenebeck. “Eliciting Expertise without Verification”. In:Proceed- ings of the 2018 ACM Conference on Economics and Computation. 2018, pp. 195–212.doi: 10.1145/3219166.3219172
-
[34]
Eliciting Informative Text Evaluations with Large Language Models
Yuxuan Lu et al. “Eliciting Informative Text Evaluations with Large Language Models”. In: Proceedings of the 25th ACM Conference on Economics and Computation (EC ’24). ACM, 2024, pp. 582–612.doi:10.1145/3670865.3673532. arXiv:2405.15077 [cs.CL]. 27
- [35]
-
[36]
Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data
Yunhao Tang et al. “Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data”. In:Advances in Neural Information Processing Systems. 2025. arXiv:2503.19618 [cs.LG]
-
[37]
Training Chain-of-Thought via Latent-Variable Inference
Du Phan et al. “Training Chain-of-Thought via Latent-Variable Inference”. In:Advances in Neural Information Processing Systems 36 (NeurIPS 2023). 2023. arXiv:2312.02179 [cs.LG]. url:https://openreview.net/forum?id=a147pIS2Co
-
[38]
arXiv preprint arXiv:2310.04363 , year=
Edward J. Hu et al. “Amortizing Intractable Inference in Large Language Models”. In: International Conference on Learning Representations. 2024. arXiv:2310.04363
-
[39]
NOVER: Incentive Training for Language Models via Verifier-Free Rein- forcement Learning
Wei Liu et al. “NOVER: Incentive Training for Language Models via Verifier-Free Rein- forcement Learning”. In:Proceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing. Association for Computational Linguistics, 2025, pp. 7439–7458. doi: 10 . 18653 / v1 / 2025 . emnlp - main . 378. arXiv: 2505 . 16022 [cs.CL].url: https : //...
work page 2025
-
[40]
Yifei Xu et al.Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks. 2025. arXiv:2506.13351 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [41]
- [42]
- [43]
-
[44]
Hunter Lightman et al. “Let’s Verify Step by Step”. In:International Conference on Learning Representations (ICLR). 2024. arXiv:2305.20050 [cs.LG].url: https://openreview.net/ forum?id=v8L0pN6EOi
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Lunjun Zhang et al. “Generative Verifiers: Reward Modeling as Next-Token Prediction”. In: International Conference on Learning Representations. 2025. arXiv:2408.15240 [cs.LG]
- [46]
-
[47]
Yefan Zhou et al.Variation in Verification: Understanding Verification Dynamics in Large Language Models. 2025. arXiv:2509.17995 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [48]
-
[49]
Rafael Rafailov et al. “Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms”. In:Advances in Neural Information Processing Systems. 2024. arXiv:2406.02900 [cs.LG]
- [50]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.