pith. sign in

arxiv: 2605.18799 · v1 · pith:MAIWBKTVnew · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.CL

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

Pith reviewed 2026-05-20 22:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learningcritic reasoningscientific reasoninglanguage model interactionsycophancytransition awarenessmulti-turn reasoning
0
0 comments X

The pith

ReCrit improves language model critic accuracy in scientific reasoning by rewarding helpful corrections over blind agreement with user feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that treating critic interactions as sequences of correctness changes between turns, rather than just final answer quality, allows reinforcement learning to train models that respond more effectively to challenges in scientific tasks. It breaks possible behaviors into four categories based on whether an initial answer starts correct or incorrect and whether the reply to criticism keeps it correct or makes it wrong. A sympathetic reader would care because models frequently abandon sound scientific solutions when criticized, which can undermine reliability in domains that depend on precise step-by-step reasoning. The method adds targeted rewards for useful fixes and penalties for unhelpful agreement, plus engineering steps to make the multi-turn training practical.

Core claim

ReCrit frames critic interaction as an inter-turn correctness-transition problem and decomposes initial-to-critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. The framework rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. Dynamic asynchronous rollout with tail-adaptive completion reduces waiting time during training. On ChemBench, TRQA, and EarthSE, this produces average Critic accuracy gains from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations confirm that transition-aware rewards and quadrant weighting generate more distinguishable signals and larger interaction-l

What carries the argument

The four-quadrant decomposition of correctness transitions that assigns rewards for correction and robustness while penalizing sycophancy.

If this is right

  • Models maintain initially correct scientific answers more reliably when users raise objections.
  • Training produces clearer learning signals from interaction-level transitions than from end-of-turn accuracy alone.
  • Dynamic asynchronous rollouts make multi-turn reinforcement learning feasible at larger scales.
  • Sycophancy is reduced while still allowing genuine corrections to user criticism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same quadrant structure could be tested in non-scientific domains such as mathematical problem solving or code review where models must resist incorrect user suggestions.
  • Applying the transition rewards to longer conversation histories might reveal whether the approach continues to separate useful feedback from harmful agreement over many turns.
  • Combining the quadrant rewards with other alignment methods could further reduce the risk of models learning biased patterns from the weighting scheme.

Load-bearing premise

That the four types of correctness transitions supply clear, distinguishable training signals that generalize beyond the specific benchmarks and model sizes tested without creating unintended response biases.

What would settle it

Training the same base models with quadrant-based rewards versus standard final-answer rewards on a fresh scientific reasoning benchmark and observing no gain or a reversal in critic-stage accuracy.

Figures

Figures reproduced from arXiv: 2605.18799 by Dianzhi Yu, Fengli Xu, Hengyuan Zhao, Lei Bai, Shuo Li, Wanghan Xu, Wanli Ouyang, Wenlong Zhang, Yaowen Hu, Yuhao Zhou, Zhenfei Yin.

Figure 1
Figure 1. Figure 1: Transition-Aware Critic Interaction. The left panel shows a scientific critic interaction where an initially wrong answer is corrected after verification. The right panel decomposes Initial-to￾Critic behavior into four transition quadrants: Correction, Robustness, Sycophancy, and Boundary. Abstract Large language models can fail in critic interaction not only by answering incor￾rectly, but also by abandoni… view at source ↗
Figure 2
Figure 2. Figure 2: ReCrit Training Pipeline. ReCrit samples multiple Initial solutions, injects critic feedback with different attitudes, samples Critic solutions, and computes transition-aware rewards from the four correctness quadrants. Group-normalized advantages then update the policy. 3.1 Problem Formulation Given a scientific question x, the model first generates an Initial solution y0. The system then provides critic … view at source ↗
Figure 3
Figure 3. Figure 3: Dynamic Asynchronous Rollout. In synchronous two-stage rollout, fast samples wait for the slowest samples before entering the critic stage. ReCrit uses asynchronous scheduling so completed samples immediately proceed to critic generation, and tail-adaptive completion finalizes the rollout batch once a sufficient fraction has completed the Critic stage. 3.6 Dynamic Asynchronous Rollout with Tail-Adaptive Co… view at source ↗
Figure 4
Figure 4. Figure 4: Boundary Weight Sensitivity. A mild penalty improves Correction, while excessive penalty increases Sycophancy. Effectiveness of each component. The first ablation block shows that the proposed training components contribute cumulative gains rather than redundant complexity: final-solution reward alone is weak, while transition-aware reward, calibrated quadrant weights, Critic-stage weighting, and finalizat… view at source ↗
Figure 5
Figure 5. Figure 5: Correction–Sycophancy Decoupling. SFT does not decouple Correction and Syco￾phancy, while ReCrit points stay above the SFT trend and provide a stronger correction signal. First 10 Last 10 0.0 0.1 0.2 0.3 0.4 0.5 Reward std before normalization 0.12 0.09 0.41 0.37 Transition reward gives denser RL signal Standard GRPO ReCrit [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Correction Case Studies. ReCrit turns a user critic into grounded correction rather than blind solution switching. Each column shows a benchmark-derived Question–Initial Solution–User Critic–Critic Solution example: the Initial solution gives a plausible but wrong rationale, and the Critic solution identifies the error and corrects it [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ReCrit, a reinforcement learning framework for training LLMs to handle critic interactions in scientific reasoning tasks. It models the problem as inter-turn correctness transitions and decomposes behaviors into four quadrants: Correction, Sycophancy, Robustness, and Boundary. The approach rewards correction and robustness while penalizing sycophancy, and employs dynamic asynchronous rollout with tail-adaptive completion for efficient training. Experiments on ChemBench, TRQA, and EarthSE benchmarks demonstrate improvements in Critic accuracy from 38.15 to 51.49 for the 4B model and from 45.40 to 55.59 for the 9B model, with ablations supporting the value of transition-aware rewards over final-answer rewards.

Significance. If the results hold under more rigorous validation, this could advance methods for robust interactive reasoning in LLMs applied to scientific domains by distinguishing useful corrections from sycophantic changes. The open code release supports reproducibility. The reported gains across model sizes and the contrast with final-answer rewards provide initial evidence for the utility of transition modeling, though questions remain about whether the quadrant weighting generalizes beyond the specific benchmarks.

major comments (3)
  1. The central claim that the four-quadrant decomposition produces distinguishable training signals rests on reliable automatic labeling of transitions during rollout. The manuscript should detail the exact criteria, thresholds, or classifiers used to assign behaviors to Correction, Sycophancy, Robustness, or Boundary quadrants (likely in the reward formulation or rollout sections), as noisy or benchmark-specific labeling would undermine the ablation advantage over final-answer rewards.
  2. Ablation results are cited as showing larger net Critic-stage improvement from transition-aware rewards and quadrant weighting. To make this load-bearing evidence robust, report per-quadrant reward magnitudes, the precise weighting scheme, and statistical tests or variance across multiple seeds, rather than qualitative statements that the signals are 'more distinguishable'.
  3. The headline accuracy gains (38.15 to 51.49 on 4B; 45.40 to 55.59 on 9B) are averages across ChemBench, TRQA, and EarthSE. Per-benchmark breakdowns with standard deviations or confidence intervals should be added to confirm that improvements are not concentrated in one dataset whose user-criticism patterns happen to align with the chosen quadrant weights.
minor comments (2)
  1. Define 'tail-adaptive completion' more explicitly, perhaps with a short algorithm sketch or example of how it reduces rollout waiting time.
  2. Add a table listing all quadrant-specific reward values and any scaling hyperparameters to aid exact reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have carefully addressed each major comment below. Where the comments identify areas needing greater clarity or additional reporting, we have revised the manuscript accordingly to strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: The central claim that the four-quadrant decomposition produces distinguishable training signals rests on reliable automatic labeling of transitions during rollout. The manuscript should detail the exact criteria, thresholds, or classifiers used to assign behaviors to Correction, Sycophancy, Robustness, or Boundary quadrants (likely in the reward formulation or rollout sections), as noisy or benchmark-specific labeling would undermine the ablation advantage over final-answer rewards.

    Authors: We agree that explicit documentation of the labeling procedure is necessary to support the central claims. The original manuscript described quadrant assignment at a high level via correctness transitions but did not provide sufficient operational detail. In the revision we have expanded Section 3.2 with the precise automatic labeling rules, including the correctness evaluator (exact match plus tolerance for numerical answers; LLM-as-judge for open-ended reasoning), the semantic-difference threshold used to distinguish Correction from Robustness, and pseudocode for the full assignment logic. These additions make the source of the training signals fully reproducible and allow readers to assess potential benchmark-specific noise. revision: yes

  2. Referee: Ablation results are cited as showing larger net Critic-stage improvement from transition-aware rewards and quadrant weighting. To make this load-bearing evidence robust, report per-quadrant reward magnitudes, the precise weighting scheme, and statistical tests or variance across multiple seeds, rather than qualitative statements that the signals are 'more distinguishable'.

    Authors: We accept that quantitative reporting is required to substantiate the ablation claims. The revised manuscript now includes an explicit table of the weighting scheme (positive reward for Correction and Robustness, negative for Sycophancy, small positive for Boundary) together with observed average reward magnitudes per quadrant. All ablation runs have been repeated across five random seeds; we report means and standard deviations and include a paired statistical test comparing transition-aware versus final-answer reward variants. These changes replace the previous qualitative description and directly address the request for more rigorous evidence. revision: yes

  3. Referee: The headline accuracy gains (38.15 to 51.49 on 4B; 45.40 to 55.59 on 9B) are averages across ChemBench, TRQA, and EarthSE. Per-benchmark breakdowns with standard deviations or confidence intervals should be added to confirm that improvements are not concentrated in one dataset whose user-criticism patterns happen to align with the chosen quadrant weights.

    Authors: We agree that aggregated results alone are insufficient to demonstrate consistent gains. The revised results section now contains a disaggregated table showing Critic accuracy for each of the three benchmarks separately, accompanied by standard deviations computed over multiple seeds. The per-benchmark numbers confirm that the reported improvements appear across all datasets rather than being driven by any single benchmark, thereby supporting the broader applicability of the quadrant-based reward design. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ReCrit's empirical RL framework

full rationale

The paper defines a four-quadrant decomposition of correctness transitions (Correction, Sycophancy, Robustness, Boundary) and assigns rewards accordingly within a standard RL setup for critic interaction. Central claims consist of measured accuracy gains on external benchmarks (ChemBench, TRQA, EarthSE) plus ablations comparing transition-aware rewards to final-answer baselines. No equations or results reduce by construction to fitted inputs, self-citations, or renamed priors; the derivation chain is self-contained with independent empirical validation and publicly released code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Limited information available from abstract only; no explicit free parameters, axioms, or invented entities beyond the introduced quadrant categories are described.

invented entities (1)
  • Four quadrants (Correction, Sycophancy, Robustness, Boundary) no independent evidence
    purpose: Decompose Initial-to-Critic behavior for targeted reward signals
    Introduced to structure the RL objective around transition types

pith-pipeline@v0.9.0 · 5818 in / 1180 out tokens · 35363 ms · 2026-05-20T22:46:32.714975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 11 internal anchors

  1. [1]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  2. [2]

    Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025

    Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025

  3. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  4. [4]

    Scimaster: Towards general-purpose scientific ai agents, part i

    Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Yuzhi Zhang, Linfeng Zhang, Siheng Chen, et al. Scimaster: Towards general-purpose scientific ai agents, part i. x-master as foundation: Can we lead on humanity’s last exam?arXiv preprint arXiv:2507.05241, 2025

  5. [5]

    Spc: Evolving self-play critic via adversarial games for llm reasoning.arXiv preprint arXiv:2504.19162, 2025

    Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning.arXiv preprint arXiv:2504.19162, 2025

  6. [6]

    Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

    Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

  7. [7]

    ELEPHANT: Measuring and understanding social sycophancy in LLMs

    Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Elephant: Measuring and understanding social sycophancy in llms.arXiv preprint arXiv:2505.13995, 2025

  8. [8]

    Syceval: Evaluating llm sycophancy

    Aaron Fanous, Jacob Goldberg, Ank Agarwal, Joanna Lin, Anson Zhou, Sonnet Xu, Vasiliki Bikia, Roxana Daneshjou, and Sanmi Koyejo. Syceval: Evaluating llm sycophancy. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 893–900, 2025

  9. [9]

    Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

    Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, et al. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  11. [11]

    Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

  12. [12]

    Opendatalab: Empowering general artificial intelligence with open datasets.arXiv preprint arXiv:2407.13773, 2024

    Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, and Dahua Lin. Opendatalab: Empowering general artificial intelligence with open datasets.arXiv preprint arXiv:2407.13773, 2024

  13. [13]

    Measuring sycophancy of language models in multi-turn dialogues.arXiv preprint arXiv:2505.23840, 2025

    Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, and Jinho D Choi. Measuring sycophancy of language models in multi-turn dialogues.arXiv preprint arXiv:2505.23840, 2025

  14. [14]

    A survey of scientific large language models: From data foundations to agent frontiers.arXiv preprint arXiv:2508.21148, 2025

    Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, et al. A survey of scientific large language models: From data foundations to agent frontiers.arXiv preprint arXiv:2508.21148, 2025. 12

  15. [15]

    Feedback friction: Llms struggle to fully incorporate external feedback.arXiv preprint arXiv:2506.11930, 2025

    Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, and Daniel Khashabi. Feedback friction: Llms struggle to fully incorporate external feedback.arXiv preprint arXiv:2506.11930, 2025

  16. [16]

    Rollout-training co-design for efficient llm-based multi-agent reinforcement learning.arXiv preprint arXiv:2602.09578, 2026

    Zhida Jiang, Zhaolong Xing, Jiawei Lu, Yipei Niu, Qingyuan Sang, Liangxu Zhang, Wenquan Dai, Junhua Shu, Jiaxing Wang, Qiangyu Pei, et al. Rollout-training co-design for efficient llm-based multi-agent reinforcement learning.arXiv preprint arXiv:2602.09578, 2026

  17. [17]

    PhD thesis, UC Berkeley, 2025

    Woosuk Kwon.vLLM: An Efficient Inference Engine for Large Language Models. PhD thesis, UC Berkeley, 2025

  18. [18]

    LLMs Get Lost In Multi-Turn Conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120, 2025

  19. [19]

    Does sycophancy change decisions? effect of llm sycophancy on ai-assisted decision-making

    Zejian Li, Jiaman Pan, Qi Liu, Yuning Xi, Yixiang Zhou, Yike Jin, Rongjie Mao, and Pei Chen. Does sycophancy change decisions? effect of llm sycophancy on ai-assisted decision-making. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2026

  20. [20]

    Comparative analysis and parametric tuning of ppo, grpo, and dapo for llm reasoning enhancement.arXiv preprint arXiv:2512.07611, 2025

    Yongsheng Lian. Comparative analysis and parametric tuning of ppo, grpo, and dapo for llm reasoning enhancement.arXiv preprint arXiv:2512.07611, 2025

  21. [21]

    Truth decay: quantifying multi-turn sycophancy in language models.arXiv preprint arXiv:2503.11656, 2025

    Joshua Liu, Aarav Jain, Soham Takuri, Srihan Vege, Aslihan Akalin, Kevin Zhu, Sean O’Brien, and Vasu Sharma. Truth decay: quantifying multi-turn sycophancy in language models.arXiv preprint arXiv:2503.11656, 2025

  22. [22]

    Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

  23. [23]

    Climaqa: An automated evaluation framework for climate question answering models.arXiv preprint arXiv:2410.16701, 2024

    Veeramakali Vignesh Manivannan, Yasaman Jafari, Srikar Eranky, Spencer Ho, Rose Yu, Duncan Watson-Parris, Yian Ma, Leon Bergen, and Taylor Berg-Kirkpatrick. Climaqa: An automated evaluation framework for climate question answering models.arXiv preprint arXiv:2410.16701, 2024

  24. [24]

    A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists.Nature Chemistry, 17(7):1027–1034, 2025

    Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoek- abu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, et al. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists.Nature Chemistry, 17(7):1027–1034, 2025

  25. [25]

    A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025

  26. [26]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  27. [27]

    Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

    Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, and Sohom Pal. Beacon: Single-turn diagnosis and mitigation of latent sycophancy in large language models.arXiv preprint arXiv:2510.16727, 2025

  28. [28]

    Brokenmath: A benchmark for sycophancy in theorem proving with llms.arXiv preprint arXiv:2510.04721, 2025

    Ivo Petrov, Jasper Dekoninck, and Martin Vechev. Brokenmath: A benchmark for sycophancy in theorem proving with llms.arXiv preprint arXiv:2510.04721, 2025

  29. [29]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  30. [30]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023. 13

  31. [31]

    Rethinking supervised fine-tuning: Em- phasizing key answer tokens for improved llm accuracy.arXiv preprint arXiv:2512.21017, 2025

    Xiaofeng Shi, Qian Kou, Yuduo Li, and Hua Zhou. Rethinking supervised fine-tuning: Em- phasizing key answer tokens for improved llm accuracy.arXiv preprint arXiv:2512.21017, 2025

  32. [32]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  33. [33]

    Llm-based multi-agent reinforcement learning: Current and future directions.arXiv preprint arXiv:2405.11106, 2024

    Chuanneng Sun, Songjun Huang, and Dario Pompili. Llm-based multi-agent reinforcement learning: Current and future directions.arXiv preprint arXiv:2405.11106, 2024

  34. [34]

    Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025

    Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, et al. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025

  35. [35]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  36. [36]

    Can llms correct themselves? a benchmark of self-correction in llms.arXiv preprint arXiv:2510.16062, 2025

    Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, et al. Can llms correct themselves? a benchmark of self-correction in llms.arXiv preprint arXiv:2510.16062, 2025

  37. [37]

    Reinforcement Learning for LLM Post-Training: A Survey

    Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216, 2024

  38. [38]

    Reinforcing multi-turn reasoning in llm agents via turn-level reward design.arXiv preprint arXiv:2505.11821, 2025

    Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, et al. Reinforcing multi-turn reasoning in llm agents via turn-level reward design.arXiv preprint arXiv:2505.11821, 2025

  39. [39]

    Critique-rl: Training language models for critiquing through two-stage reinforcement learning.arXiv preprint arXiv:2510.24320, 2025

    Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, et al. Critique-rl: Training language models for critiquing through two-stage reinforcement learning.arXiv preprint arXiv:2510.24320, 2025

  40. [40]

    Stepwiser: Stepwise generative judges for wiser reasoning.arXiv preprint arXiv:2508.19229, 2025

    Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, and Sainbayar Sukhbaatar. Stepwiser: Stepwise generative judges for wiser reasoning.arXiv preprint arXiv:2508.19229, 2025

  41. [41]

    Earthse: A benchmark evaluating earth scientific exploration capability for large language models

    Wanghan Xu, Xiangyu Zhao, Yuhao Zhou, Xiaoyu Yue, Ben Fei, Fenghua Ling, Wenlong Zhang, and Lei Bai. Earthse: A benchmark evaluating earth scientific exploration capability for large language models. InThe Fourteenth International Conference on Learning Representations, 2025

  42. [42]

    Probing scientific general intelligence of llms with scientist- aligned workflows.arXiv preprint arXiv:2512.16969, 2025

    Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, et al. Probing scientific general intelligence of llms with scientist- aligned workflows.arXiv preprint arXiv:2512.16969, 2025

  43. [43]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  44. [44]

    A survey on multi-turn interaction capabilities of large language models.arXiv preprint arXiv:2501.09959, 2025

    Chen Zhang, Xinyi Dai, Yaxiong Wu, Qu Yang, Yasheng Wang, Ruiming Tang, and Yong Liu. A survey on multi-turn interaction capabilities of large language models.arXiv preprint arXiv:2501.09959, 2025

  45. [45]

    Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026

    Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, et al. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026

  46. [46]

    Critique-grpo: Advancing llm reasoning with natural language and numerical feedback

    Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback. arXiv preprint arXiv:2506.03106, 2025

  47. [47]

    Origene: A self-evolving virtual disease biologist automating therapeutic target discovery.BioRxiv, pages 2025–06, 2025

    Zhongyue Zhang, Zijie Qiu, Yingcheng Wu, Shuya Li, Dingyan Wang, Yong Liu, Zhuomin Zhou, Yusong Hu, Yuhan Chen, Duo An, et al. Origene: A self-evolving virtual disease biologist automating therapeutic target discovery.BioRxiv, pages 2025–06, 2025. 14

  48. [48]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

  49. [49]

    Swift: a scalable lightweight infrastructure for fine-tuning

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

  50. [50]

    Intern-s1-pro: Scientific multimodal foundation model at trillion scale.arXiv preprint arXiv:2603.25040, 2026

    Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, et al. Intern-s1-pro: Scientific multimodal foundation model at trillion scale.arXiv preprint arXiv:2603.25040, 2026. 15 A Training and Evaluation Details We use ms-swift [49] as the training framework. Tables 4–11 report the sett...

  51. [52]

    \boxed{A}

    After </think>, provide only the final answer. For multiple-choice questions, output only the option label(s) (e.g., A, B, C, D, BC, DI, ADI, etc.) without including the option content or any explanatory phrases such as "\boxed{A}" or "The answer is C." Initial Solution <think> In cationic polymerization, monomer reactivity is governed by the stability of...

  52. [54]

    \boxed{A}

    After </think>, provide only the final answer. For multiple-choice questions, output only the option label(s) (e.g., A, B, C, D, BC, DI, ADI, etc.) without including the option content or any explanatory phrases such as "\boxed{A}" or "The answer is C." Initial Solution <think> ABCA1 is a transporter involved in cholesterol efflux, and its suppression in ...

  53. [55]

    Clearly show your reasoning process enclosed within <think> and </think>

  54. [56]

    \boxed{A}

    After </think>, provide only the final answer. For multiple-choice questions, output only the option label(s) (e.g., A, B, C, D, BC, DI, ADI, etc.) without including the option content or any explanatory phrases such as "\boxed{A}" or "The answer is C." Initial Solution <think> The question asks about the primary advantage of the one-parameter wing-scalin...