pith. sign in

arxiv: 2605.14636 · v1 · pith:CSKLY5YXnew · submitted 2026-05-14 · 💻 cs.AI

Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning

Pith reviewed 2026-06-30 20:44 UTC · model grok-4.3

classification 💻 cs.AI
keywords temporal leakageex-ante reasoninglarge language modelsfine-tuningtemporal cutoffcritique trainingLLM verification
0
0 comments X

The pith

A fine-tuning method trains LLMs to detect and reject answers that use knowledge unavailable before a given cutoff.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models leak post-cutoff information when asked to answer from an earlier time perspective. Prompting steers models into a temporal frame but fails to instill the ability to verify whether a generated response respects the cutoff, because admissibility is a relation between answer and cutoff rather than an intrinsic property of the answer. Standard supervised fine-tuning inherits the same limitation. The paper therefore introduces Temporal Critique Fine-Tuning, which supplies the model with query, cutoff, and candidate response triples and trains it to flag leakage, explain violations, and judge admissibility. On Qwen2.5-7B-Instruct and 14B-Instruct the method cuts average leakage by 41.89 and 37.79 percentage points relative to prompting and ordinary SFT baselines.

Core claim

The Temporal Critique Fine-Tuning framework trains models, given a query, cutoff, and candidate response, to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility, thereby enabling cutoff-aware verification that prompting and standard supervised fine-tuning do not produce.

What carries the argument

Temporal Critique Fine-Tuning (TCFT), which supplies supervised examples of leakage identification and admissibility judgment to teach cutoff-aware verification.

If this is right

  • TCFT reduces average leakage by 41.89 percentage points on Qwen2.5-7B-Instruct and 37.79 on Qwen2.5-14B-Instruct relative to prompting and SFT baselines.
  • Explicit cutoff statements outperform implicit historical framings at reducing leakage.
  • Prefix constraints reduce leakage more effectively than suffix constraints.
  • Ex-ante correctness cannot be learned by ordinary supervised fine-tuning because it is a relation between answer and cutoff.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same critique-training pattern could be applied to other relational constraints such as domain-specific knowledge limits or source attribution.
  • Verification training might improve a model's ability to refuse answers when any required fact is missing, not only temporally missing facts.
  • The approach suggests a route to training models that maintain consistency across multiple time-stamped queries in a single conversation.

Load-bearing premise

Training a model to critique leakage on supervised examples will cause it to apply cutoff-aware verification during ordinary generation instead of merely memorizing critique patterns.

What would settle it

After TCFT training, evaluate the model on fresh queries whose post-cutoff facts were never seen in the critique examples; if leakage rates remain comparable to prompting baselines, the generalization claim is false.

Figures

Figures reproduced from arXiv: 2605.14636 by Chenlu Ding, Jiancan Wu, Xiang Wang, Yancheng Yuan, Yanchen Luo, Zheyuan Liu.

Figure 1
Figure 1. Figure 1: Overview of ex-ante reasoning tasks, temporal leakage in SOTA models, and the comparison [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Investigating the brittleness of prompt-based control and temporal leakage in proprietary [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-template generalization of TCFT on QA and Wiki with Qwen2.5-7B-Instruct. Rows [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation results of TCFT on QA and Wiki with Qwen2.5-7B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Large language models (LLMs) often fail to reason under temporal cutoffs: when prompted to answer from the standpoint of an earlier time, they exploit knowledge that became available only later. We study this failure through the lens of ex-ante reasoning, where a model must rely exclusively on information knowable before a cutoff. Through a systematic analysis of prompt-level interventions, we find that temporal leakage is highly sensitive to cutoff formulation and instruction placement: explicit cutoff statements outperform implicit historical framings, and prefix constraints reduce leakage more effectively than suffix constraints. These findings indicate that prompting can steer models into a temporal frame, but does not endow them with the ability to verify whether a response is temporally admissible. We further argue that supervised fine-tuning is insufficient, since ex-ante correctness is not an intrinsic property of an answer, but a relation between the answer and the cutoff. To address this gap, we propose TCFT, a Temporal Critique Fine-Tuning framework that trains models to acquire cutoff-aware temporal verification. Given a query, a cutoff, and a candidate response, TCFT teaches the model to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show that TCFT consistently outperforms prompting and SFT baselines, reducing average leakage by 41.89 and 37.79 percentage points, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that LLMs fail at ex-ante reasoning under temporal cutoffs by leaking post-cutoff knowledge. Systematic prompt analysis shows leakage is sensitive to cutoff formulation and instruction placement, with explicit prefixes most effective, but prompting alone cannot teach verification. The authors argue SFT is insufficient because temporal admissibility is relational rather than intrinsic to an answer. They introduce TCFT, which fine-tunes on (query, cutoff, response) triples to produce leakage identification, violation explanations, and admissibility judgments. Experiments on Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct report that TCFT reduces average leakage by 41.89 and 37.79 percentage points over prompting and SFT baselines.

Significance. If the gains prove robust to distribution shift, the work would be significant for temporal reasoning in LLMs, a capability relevant to historical QA, forecasting, and knowledge-grounded generation. The explicit separation of steering (prompting) from learned verification (critique fine-tuning) is a useful conceptual contribution, and the consistent improvements across two model scales provide initial evidence that the approach is viable. The paper does not claim parameter-free derivations or machine-checked proofs, but the experimental deltas are the primary result.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: The central claim that TCFT produces generalizable cutoff-aware verification (rather than memorization of critique patterns) rests on the reported leakage reductions, yet no description is given of held-out cutoffs, novel response styles, or out-of-distribution domains in the evaluation. Without such controls, the 37–42 pp gains could be explained by surface-pattern matching on the supervised triples.
  2. [Abstract] Abstract: The quantitative results (41.89 pp and 37.79 pp reductions) are presented without dataset size, leakage metric definition, baseline implementation details, or statistical significance tests. This absence prevents assessment of whether the deltas are reliable or sensitive to evaluation choices.
  3. [Method] Method section: The paper correctly notes that ex-ante correctness is relational, yet TCFT remains a standard supervised objective over fixed (query, cutoff, response) examples. Additional analysis is required to show how the critique objective captures the relational structure in a way that ordinary SFT on answers does not.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly stated the number of evaluation instances and how leakage is operationalized (e.g., exact knowledge overlap or human judgment).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate planned revisions to improve clarity, rigor, and support for our claims.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: The central claim that TCFT produces generalizable cutoff-aware verification (rather than memorization of critique patterns) rests on the reported leakage reductions, yet no description is given of held-out cutoffs, novel response styles, or out-of-distribution domains in the evaluation. Without such controls, the 37–42 pp gains could be explained by surface-pattern matching on the supervised triples.

    Authors: We acknowledge this limitation in the current manuscript. The reported experiments use cutoffs and response styles drawn from the same distribution as the supervised training triples, without explicit held-out cutoffs or OOD domains. This leaves open the possibility of surface-level pattern matching. In the revision we will add a new subsection in Experiments with held-out temporal cutoffs, novel response styles, and queries from out-of-distribution domains (e.g., different topical areas), reporting leakage reductions under these conditions to better substantiate the generalizability claim. revision: yes

  2. Referee: [Abstract] Abstract: The quantitative results (41.89 pp and 37.79 pp reductions) are presented without dataset size, leakage metric definition, baseline implementation details, or statistical significance tests. This absence prevents assessment of whether the deltas are reliable or sensitive to evaluation choices.

    Authors: We agree that the abstract should be self-contained for these details. While the full definitions, dataset sizes (approximately 12k training triples and 2.5k evaluation instances per model), leakage metric (binary post-cutoff knowledge detection), baseline implementations, and bootstrap significance tests appear in Sections 4 and 5, we will revise the abstract to include concise statements of dataset scale, metric definition, baseline descriptions, and significance results. revision: yes

  3. Referee: [Method] Method section: The paper correctly notes that ex-ante correctness is relational, yet TCFT remains a standard supervised objective over fixed (query, cutoff, response) examples. Additional analysis is required to show how the critique objective captures the relational structure in a way that ordinary SFT on answers does not.

    Authors: We accept that the current Method section provides only a high-level argument for the relational nature of admissibility. To demonstrate the distinction, we will add an analysis subsection that (a) compares TCFT directly against standard answer-only SFT on the same queries and cutoffs, (b) ablates the three critique heads (leakage identification, violation explanation, admissibility judgment), and (c) reports how performance on relational verification tasks degrades under answer-only supervision. This will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external experimental validation

full rationale

The paper defines TCFT as a supervised fine-tuning procedure on (query, cutoff, response) triples that produces leakage identification, violation explanations, and admissibility judgments. It then reports performance deltas on held-out evaluations against prompting and standard SFT baselines. No equations, fitted parameters, or self-citations appear in the provided text. The central claim (leakage reduction of ~38-42 pp) is an observed experimental outcome rather than a quantity defined in terms of the training objective itself. The derivation chain therefore contains no self-definitional, fitted-input-called-prediction, or self-citation-load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that temporal admissibility can be learned as a supervised classification task over (query, cutoff, response) triples.

axioms (1)
  • domain assumption Ex-ante correctness is a relation between answer and cutoff that can be taught via supervised critique examples
    Stated directly in the abstract as the motivation for moving beyond SFT

pith-pipeline@v0.9.1-grok · 5809 in / 1193 out tokens · 21598 ms · 2026-06-30T20:44:05.494441+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 15 canonical work pages · 7 internal anchors

  1. [1]

    ExAnte: A benchmark for ex-ante inference in large language models

    Yachuan Liu, Xiaochun Wei, Lin Shi, Xinnuo Li, Bohan Zhang, Paramveer Dhillon, and Qiaozhu Mei. ExAnte: A benchmark for ex-ante inference in large language models. InEACL (V olume 1: Long Papers), pages 1551–1571. Association for Computational Linguistics, 2026

  2. [2]

    Large language models in finance (finllms)

    Jean Lee, Nicholas Stevens, and Soyeon Caren Han. Large language models in finance (finllms). Neural Comput. Appl., 37(30):24853–24867, 2025

  3. [3]

    Chatgpt for textual analysis? how to use generative llms in accounting research

    Ties de Kok. Chatgpt for textual analysis? how to use generative llms in accounting research. Manag. Sci., 71(9):7888–7906, 2025

  4. [4]

    Battleagent: Multi-modal dynamic emulation on historical battles to complement historical analysis

    Shuhang Lin, Wenyue Hua, Lingyao Li, Che-Jui Chang, Lizhou Fan, Jianchao Ji, Hang Hua, Mingyu Jin, Jiebo Luo, and Yongfeng Zhang. Battleagent: Multi-modal dynamic emulation on historical battles to complement historical analysis. InEMNLP (System Demonstrations), pages 172–181. Association for Computational Linguistics, 2024

  5. [5]

    Researchagent: Iterative research idea generation over scientific literature with large language models

    Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models. In NAACL (Long Papers), pages 6709–6738. Association for Computational Linguistics, 2025

  6. [6]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.CoRR, abs/2303.08774, 2023

  7. [7]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: A family of highly capable multimodal models.CoRR, abs/2312.11805, 2023

  8. [8]

    Lawrie, Daniel Khashabi, and Ben- jamin Van Durme

    Jeffrey Cheng, Marc Marone, Orion Weller, Dawn J. Lawrie, Daniel Khashabi, and Ben- jamin Van Durme. Dated data: Tracing knowledge cutoffs in large language models.CoRR, abs/2403.12958, 2024

  9. [9]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022

  10. [10]

    Towards mitigating LLM hallucination via self reflection

    Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating LLM hallucination via self reflection. InEMNLP (Findings), Findings of ACL, pages 1827–

  11. [11]

    Association for Computational Linguistics, 2023

  12. [12]

    Critique Fine-Tuning: Learning to critique is more effective than learning to imitate.CoRR, abs/2501.17703, 2025

    Yubo Wang, Xiang Yue, and Wenhu Chen. Critique Fine-Tuning: Learning to critique is more effective than learning to imitate.CoRR, abs/2501.17703, 2025

  13. [13]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR. OpenReview.net, 2022

  14. [14]

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, ...

  15. [15]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In ICLR. OpenReview.net, 2022. 10

  16. [16]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

  17. [17]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InACL (1), pages 13484–13508. Association for Computational Linguistics, 2023

  18. [18]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR. OpenReview.net, 2024

  19. [20]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  20. [21]

    Wikipedia

    Wikimedia Foundation. Wikipedia. https://www.wikipedia.org/, 2026. Accessed: 2026- 05-07

  21. [22]

    Large language models are zero-shot time series forecasters.Advances in neural information processing systems, 36:19622– 19635, 2023

    Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. Large language models are zero-shot time series forecasters.Advances in neural information processing systems, 36:19622– 19635, 2023

  22. [23]

    Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

    Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models.arXiv preprint arXiv:2310.01728, 2023

  23. [24]

    Order matters: Investigate the position bias in multi-constraint instruction following.Seed, 1(2):3–4

    Multi-Constraint Instruction. Order matters: Investigate the position bias in multi-constraint instruction following.Seed, 1(2):3–4

  24. [25]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ ,

  25. [26]

    Accessed: 2026-05-07

  26. [27]

    Living in the moment: Can large language models grasp co-temporal reasoning? InACL (1), pages 13014–13033

    Zhaochen Su, Juntao Li, Jun Zhang, Tong Zhu, Xiaoye Qu, Pan Zhou, Yan Bowen, Yu Cheng, and Min Zhang. Living in the moment: Can large language models grasp co-temporal reasoning? InACL (1), pages 13014–13033. Association for Computational Linguistics, 2024

  27. [28]

    Towards benchmarking and improving the temporal reasoning capability of large language models

    Qingyu Tan, Hwee Tou Ng, and Lidong Bing. Towards benchmarking and improving the temporal reasoning capability of large language models. InACL (1), pages 14820–14835. Association for Computational Linguistics, 2023

  28. [29]

    TRAM: benchmarking temporal reasoning for large language models

    Yuqing Wang and Yun Zhao. TRAM: benchmarking temporal reasoning for large language models. InACL (Findings), Findings of ACL, pages 6389–6415. Association for Computational Linguistics, 2024

  29. [30]

    TimeBench: A comprehensive evaluation of temporal reasoning abilities in large language models

    Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang, Ming Liu, and Bing Qin. TimeBench: A comprehensive evaluation of temporal reasoning abilities in large language models. InACL (1), pages 1204–1228. Association for Computational Linguistics, 2024

  30. [31]

    Test of Time: A benchmark for evaluating llms on temporal reasoning

    Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, and Bryan Perozzi. Test of Time: A benchmark for evaluating llms on temporal reasoning. InICLR. OpenReview.net, 2025. 11

  31. [32]

    StreamingQA: A benchmark for adaptation to new knowledge over time in question answering models

    Adam Liska, Tomás Kociský, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, Cyprien de Masson d’Autume, Tim Scholtes, Manzil Zaheer, Susannah Young, Ellen Gilsenan- McMahon, Sophia Austin, Phil Blunsom, and Angeliki Lazaridou. StreamingQA: A benchmark for adaptation to new knowledge over time in question answering models. InICML, Proceedings...

  32. [33]

    Time travel in llms: Tracing data contamination in large language models

    Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. InICLR. OpenReview.net, 2024

  33. [34]

    To the cutoff

    Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, and Samuel Dooley. To the cutoff... and beyond? A longitudinal perspective on LLM data contamination. InICLR. OpenReview.net, 2024

  34. [35]

    In-context unlearning: Language models as few-shot unlearners

    Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few-shot unlearners. InICML. OpenReview.net, 2024

  35. [37]

    Unified parameter-efficient unlearning for llms

    Chenlu Ding, Jiancan Wu, Yancheng Yuan, Jinda Lu, Kai Zhang, Alex Su, Xiang Wang, and Xiangnan He. Unified parameter-efficient unlearning for llms. InICLR. OpenReview.net, 2025

  36. [38]

    Machine unlearning in generative AI: A survey.CoRR, abs/2407.20516, 2024

    Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Machine unlearning in generative AI: A survey.CoRR, abs/2407.20516, 2024

  37. [39]

    Towards safer large language models through machine unlearning

    Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards safer large language models through machine unlearning. InACL (Findings), Findings of ACL, pages 1817–1829. Association for Computational Linguistics, 2024

  38. [40]

    Unrolling sgd: Understanding factors influencing machine unlearning

    Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 303–319. IEEE, 2022

  39. [41]

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning.CoRR, abs/2404.05868, 2024

  40. [42]

    Mllmeraser: Achieving test-time unlearning in multimodal large language models through activation steering.CoRR, abs/2510.04217, 2025

    Chenlu Ding, Jiancan Wu, Leheng Sheng, Fan Zhang, Yancheng Yuan, Xiang Wang, and Xiangnan He. Mllmeraser: Achieving test-time unlearning in multimodal large language models through activation steering.CoRR, abs/2510.04217, 2025

  41. [43]

    Lookahead bias in pretrained language models

    Suproteem K Sarkar and Keyon Vafa. Lookahead bias in pretrained language models. InICML 2025 Workshop on Reliable and Responsible F oundation Models, 2024

  42. [44]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InNeurIPS, 2023

  43. [45]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InNeurIPS, 2023

  44. [46]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738, 2023

  45. [47]

    Double-checker: Enhancing reasoning of slow-thinking llms via self-critical fine-tuning.arXiv preprint arXiv:2506.21285, 2025

    Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, et al. Double-checker: Enhancing reasoning of slow-thinking llms via self-critical fine-tuning.arXiv preprint arXiv:2506.21285, 2025

  46. [48]

    , Marquez, M

    Karthik Valmeekam, Matthew Marquez, and Subbarao Kambhampati. Can large language models really improve by self-critiquing their own plans?arXiv preprint arXiv:2310.08118, 2023. 12

  47. [49]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

  48. [50]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 9426–9439, 2024

  49. [51]

    arXiv preprint arXiv:2412.01981 , year=

    Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels.arXiv preprint arXiv:2412.01981, 2024

  50. [52]

    arXiv preprint arXiv:2408.11791 , year =

    Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, and Prithviraj Ammanabrolu. Critique-out-loud reward models.arXiv preprint arXiv:2408.11791, 2024

  51. [53]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

  52. [54]

    Introducing claude opus 4.6

    Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026. Accessed: 2026-05-07

  53. [55]

    Gemini 3.1 pro preview

    Google. Gemini 3.1 pro preview. https://ai.google.dev/gemini-api/docs/models/ gemini-3.1-pro-preview, 2026. Accessed: 2026-05-07

  54. [56]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  55. [57]

    prefix advantage

    Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm hallucination via self reflection. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, 2023. A Related Work A.1 Temporal Reasoning Evaluating LLMs on time-related dimensions has historically centered on temporal reasoning...

  56. [58]

    The 2020 Summer Olympics were scheduled to take place in Tokyo, Japan

  57. [59]

    The opening ceremony of the 2020 Summer Olympics was planned for July 24, 2020

  58. [60]

    The 2020 Summer Olympics were originally set to include 33 sports across 339 events

  59. [61]

    The Olympic Village for the 2020 Summer Olympics was located in Musashino Forest Park

  60. [62]

    I don’t know

    The torch relay for the 2020 Summer Olympics began on March 26, 2020. 19 Critique 1–3 are fine as of Jan. 1, 2020: Tokyo was the host city, the opening ceremony was scheduled for July 24, 2020, and the program was planned for 33 sports and 339 events. 4 is wrong factually. Musashino Forest Park was a venue, not the Olympic Village. The Olympic Village was...

  61. [67]

    Few-shot

    Emma Watson has blonde hair and blue eyes. Few-shot

  62. [69]

    She played the role of Hermione Granger in the Harry Potter film series

  63. [70]

    Emma Watson was appointed as a UN Women Goodwill Ambassador in 2014(after the cutoff)

  64. [71]

    She graduated from Brown University in 2017 with a degree in English literature and French studies(post-cutoff; actual graduation was in 2014)

  65. [72]

    21 Chain-of-thought

    As of January 1st, 2011, she was known for her roles in the Harry Potter films and had not yet become a UN Women Goodwill Ambassador. 21 Chain-of-thought

  66. [74]

    Emma Watson played the character Hermione Granger in the Harry Potter film series

  67. [75]

    Emma Watson is an English actress and UN Women Goodwill Ambassador(appointed in Jul. 2014)

  68. [76]

    Emma Watson graduated from Brown University in 2014(after the cutoff)

  69. [77]

    Self-verification

    Emma Watson has blonde hair and blue eyes. Self-verification

  70. [79]

    The Perks of Being a Wallflower

    On January 1, 2011, Emma Watson was starring in the film “The Perks of Being a Wallflower.”

  71. [80]

    As of January 1, 2011, Emma Watson had completed her role as Hermione Granger in the Harry Potter film series

  72. [81]

    Emma Watson was on an academic leave from Brown University(announced after the cutoff)as of January 1, 2011

  73. [82]

    Ours (TCFT)

    By January 1, 2011, Emma Watson had been involved in various UN Women campaigns promoting gender equality (UN Women involvement began after the cutoff). Ours (TCFT)

  74. [83]

    Emma Watson was born on April 15, 1990

  75. [84]

    She was known for her role as Hermione Granger in the Harry Potter film series

  76. [85]

    On January 1, 2011, she was 20 years old

  77. [86]

    Harry Potter and the Deathly Hallows – Part 1

    She had appeared in the film “Harry Potter and the Deathly Hallows – Part 1” before this date

  78. [87]

    She was a British actress. 22