ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
Pith reviewed 2026-05-20 22:46 UTC · model grok-4.3
The pith
ReCrit improves language model critic accuracy in scientific reasoning by rewarding helpful corrections over blind agreement with user feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReCrit frames critic interaction as an inter-turn correctness-transition problem and decomposes initial-to-critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. The framework rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. Dynamic asynchronous rollout with tail-adaptive completion reduces waiting time during training. On ChemBench, TRQA, and EarthSE, this produces average Critic accuracy gains from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations confirm that transition-aware rewards and quadrant weighting generate more distinguishable signals and larger interaction-l
What carries the argument
The four-quadrant decomposition of correctness transitions that assigns rewards for correction and robustness while penalizing sycophancy.
If this is right
- Models maintain initially correct scientific answers more reliably when users raise objections.
- Training produces clearer learning signals from interaction-level transitions than from end-of-turn accuracy alone.
- Dynamic asynchronous rollouts make multi-turn reinforcement learning feasible at larger scales.
- Sycophancy is reduced while still allowing genuine corrections to user criticism.
Where Pith is reading between the lines
- The same quadrant structure could be tested in non-scientific domains such as mathematical problem solving or code review where models must resist incorrect user suggestions.
- Applying the transition rewards to longer conversation histories might reveal whether the approach continues to separate useful feedback from harmful agreement over many turns.
- Combining the quadrant rewards with other alignment methods could further reduce the risk of models learning biased patterns from the weighting scheme.
Load-bearing premise
That the four types of correctness transitions supply clear, distinguishable training signals that generalize beyond the specific benchmarks and model sizes tested without creating unintended response biases.
What would settle it
Training the same base models with quadrant-based rewards versus standard final-answer rewards on a fresh scientific reasoning benchmark and observing no gain or a reversal in critic-stage accuracy.
Figures
read the original abstract
Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ReCrit, a reinforcement learning framework for training LLMs to handle critic interactions in scientific reasoning tasks. It models the problem as inter-turn correctness transitions and decomposes behaviors into four quadrants: Correction, Sycophancy, Robustness, and Boundary. The approach rewards correction and robustness while penalizing sycophancy, and employs dynamic asynchronous rollout with tail-adaptive completion for efficient training. Experiments on ChemBench, TRQA, and EarthSE benchmarks demonstrate improvements in Critic accuracy from 38.15 to 51.49 for the 4B model and from 45.40 to 55.59 for the 9B model, with ablations supporting the value of transition-aware rewards over final-answer rewards.
Significance. If the results hold under more rigorous validation, this could advance methods for robust interactive reasoning in LLMs applied to scientific domains by distinguishing useful corrections from sycophantic changes. The open code release supports reproducibility. The reported gains across model sizes and the contrast with final-answer rewards provide initial evidence for the utility of transition modeling, though questions remain about whether the quadrant weighting generalizes beyond the specific benchmarks.
major comments (3)
- The central claim that the four-quadrant decomposition produces distinguishable training signals rests on reliable automatic labeling of transitions during rollout. The manuscript should detail the exact criteria, thresholds, or classifiers used to assign behaviors to Correction, Sycophancy, Robustness, or Boundary quadrants (likely in the reward formulation or rollout sections), as noisy or benchmark-specific labeling would undermine the ablation advantage over final-answer rewards.
- Ablation results are cited as showing larger net Critic-stage improvement from transition-aware rewards and quadrant weighting. To make this load-bearing evidence robust, report per-quadrant reward magnitudes, the precise weighting scheme, and statistical tests or variance across multiple seeds, rather than qualitative statements that the signals are 'more distinguishable'.
- The headline accuracy gains (38.15 to 51.49 on 4B; 45.40 to 55.59 on 9B) are averages across ChemBench, TRQA, and EarthSE. Per-benchmark breakdowns with standard deviations or confidence intervals should be added to confirm that improvements are not concentrated in one dataset whose user-criticism patterns happen to align with the chosen quadrant weights.
minor comments (2)
- Define 'tail-adaptive completion' more explicitly, perhaps with a short algorithm sketch or example of how it reduces rollout waiting time.
- Add a table listing all quadrant-specific reward values and any scaling hyperparameters to aid exact reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have carefully addressed each major comment below. Where the comments identify areas needing greater clarity or additional reporting, we have revised the manuscript accordingly to strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: The central claim that the four-quadrant decomposition produces distinguishable training signals rests on reliable automatic labeling of transitions during rollout. The manuscript should detail the exact criteria, thresholds, or classifiers used to assign behaviors to Correction, Sycophancy, Robustness, or Boundary quadrants (likely in the reward formulation or rollout sections), as noisy or benchmark-specific labeling would undermine the ablation advantage over final-answer rewards.
Authors: We agree that explicit documentation of the labeling procedure is necessary to support the central claims. The original manuscript described quadrant assignment at a high level via correctness transitions but did not provide sufficient operational detail. In the revision we have expanded Section 3.2 with the precise automatic labeling rules, including the correctness evaluator (exact match plus tolerance for numerical answers; LLM-as-judge for open-ended reasoning), the semantic-difference threshold used to distinguish Correction from Robustness, and pseudocode for the full assignment logic. These additions make the source of the training signals fully reproducible and allow readers to assess potential benchmark-specific noise. revision: yes
-
Referee: Ablation results are cited as showing larger net Critic-stage improvement from transition-aware rewards and quadrant weighting. To make this load-bearing evidence robust, report per-quadrant reward magnitudes, the precise weighting scheme, and statistical tests or variance across multiple seeds, rather than qualitative statements that the signals are 'more distinguishable'.
Authors: We accept that quantitative reporting is required to substantiate the ablation claims. The revised manuscript now includes an explicit table of the weighting scheme (positive reward for Correction and Robustness, negative for Sycophancy, small positive for Boundary) together with observed average reward magnitudes per quadrant. All ablation runs have been repeated across five random seeds; we report means and standard deviations and include a paired statistical test comparing transition-aware versus final-answer reward variants. These changes replace the previous qualitative description and directly address the request for more rigorous evidence. revision: yes
-
Referee: The headline accuracy gains (38.15 to 51.49 on 4B; 45.40 to 55.59 on 9B) are averages across ChemBench, TRQA, and EarthSE. Per-benchmark breakdowns with standard deviations or confidence intervals should be added to confirm that improvements are not concentrated in one dataset whose user-criticism patterns happen to align with the chosen quadrant weights.
Authors: We agree that aggregated results alone are insufficient to demonstrate consistent gains. The revised results section now contains a disaggregated table showing Critic accuracy for each of the three benchmarks separately, accompanied by standard deviations computed over multiple seeds. The per-benchmark numbers confirm that the reported improvements appear across all datasets rather than being driven by any single benchmark, thereby supporting the broader applicability of the quadrant-based reward design. revision: yes
Circularity Check
No significant circularity in ReCrit's empirical RL framework
full rationale
The paper defines a four-quadrant decomposition of correctness transitions (Correction, Sycophancy, Robustness, Boundary) and assigns rewards accordingly within a standard RL setup for critic interaction. Central claims consist of measured accuracy gains on external benchmarks (ChemBench, TRQA, EarthSE) plus ablations comparing transition-aware rewards to final-answer baselines. No equations or results reduce by construction to fitted inputs, self-citations, or renamed priors; the derivation chain is self-contained with independent empirical validation and publicly released code.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Four quadrants (Correction, Sycophancy, Robustness, Boundary)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ReCrit assigns distinct weights to these quadrants: R = w_corr I[S0=0∧S1=1] + w_rob I[S0=1∧S1=1] − w_syco I[S0=1∧S1=0] − w_boun I[S0=0∧S1=0] with defaults 1.0, 0.6, −1.0, −0.1.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025
Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Scimaster: Towards general-purpose scientific ai agents, part i
Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Yuzhi Zhang, Linfeng Zhang, Siheng Chen, et al. Scimaster: Towards general-purpose scientific ai agents, part i. x-master as foundation: Can we lead on humanity’s last exam?arXiv preprint arXiv:2507.05241, 2025
-
[5]
Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning.arXiv preprint arXiv:2504.19162, 2025
-
[6]
Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025
Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025
-
[7]
ELEPHANT: Measuring and understanding social sycophancy in LLMs
Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Elephant: Measuring and understanding social sycophancy in llms.arXiv preprint arXiv:2505.13995, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Syceval: Evaluating llm sycophancy
Aaron Fanous, Jacob Goldberg, Ank Agarwal, Joanna Lin, Anson Zhou, Sonnet Xu, Vasiliki Bikia, Roxana Daneshjou, and Sanmi Koyejo. Syceval: Evaluating llm sycophancy. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 893–900, 2025
work page 2025
-
[9]
Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, et al. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025
-
[12]
Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, and Dahua Lin. Opendatalab: Empowering general artificial intelligence with open datasets.arXiv preprint arXiv:2407.13773, 2024
-
[13]
Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, and Jinho D Choi. Measuring sycophancy of language models in multi-turn dialogues.arXiv preprint arXiv:2505.23840, 2025
-
[14]
Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, et al. A survey of scientific large language models: From data foundations to agent frontiers.arXiv preprint arXiv:2508.21148, 2025. 12
-
[15]
Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, and Daniel Khashabi. Feedback friction: Llms struggle to fully incorporate external feedback.arXiv preprint arXiv:2506.11930, 2025
-
[16]
Zhida Jiang, Zhaolong Xing, Jiawei Lu, Yipei Niu, Qingyuan Sang, Liangxu Zhang, Wenquan Dai, Junhua Shu, Jiaxing Wang, Qiangyu Pei, et al. Rollout-training co-design for efficient llm-based multi-agent reinforcement learning.arXiv preprint arXiv:2602.09578, 2026
-
[17]
Woosuk Kwon.vLLM: An Efficient Inference Engine for Large Language Models. PhD thesis, UC Berkeley, 2025
work page 2025
-
[18]
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Does sycophancy change decisions? effect of llm sycophancy on ai-assisted decision-making
Zejian Li, Jiaman Pan, Qi Liu, Yuning Xi, Yixiang Zhou, Yike Jin, Rongjie Mao, and Pei Chen. Does sycophancy change decisions? effect of llm sycophancy on ai-assisted decision-making. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2026
work page 2026
-
[20]
Yongsheng Lian. Comparative analysis and parametric tuning of ppo, grpo, and dapo for llm reasoning enhancement.arXiv preprint arXiv:2512.07611, 2025
-
[21]
Joshua Liu, Aarav Jain, Soham Takuri, Srihan Vege, Aslihan Akalin, Kevin Zhu, Sean O’Brien, and Vasu Sharma. Truth decay: quantifying multi-turn sycophancy in language models.arXiv preprint arXiv:2503.11656, 2025
-
[22]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023
work page 2023
-
[23]
Veeramakali Vignesh Manivannan, Yasaman Jafari, Srikar Eranky, Spencer Ho, Rose Yu, Duncan Watson-Parris, Yian Ma, Leon Bergen, and Taylor Berg-Kirkpatrick. Climaqa: An automated evaluation framework for climate question answering models.arXiv preprint arXiv:2410.16701, 2024
-
[24]
Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoek- abu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, et al. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists.Nature Chemistry, 17(7):1027–1034, 2025
work page 2025
-
[25]
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology, 16(5):1–72, 2025
work page 2025
-
[26]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[27]
Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models
Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, and Sohom Pal. Beacon: Single-turn diagnosis and mitigation of latent sycophancy in large language models.arXiv preprint arXiv:2510.16727, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Ivo Petrov, Jasper Dekoninck, and Martin Vechev. Brokenmath: A benchmark for sycophancy in theorem proving with llms.arXiv preprint arXiv:2510.04721, 2025
-
[29]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[30]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Xiaofeng Shi, Qian Kou, Yuduo Li, and Hua Zhou. Rethinking supervised fine-tuning: Em- phasizing key answer tokens for improved llm accuracy.arXiv preprint arXiv:2512.21017, 2025
-
[32]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
work page 2023
-
[33]
Chuanneng Sun, Songjun Huang, and Dario Pompili. Llm-based multi-agent reinforcement learning: Current and future directions.arXiv preprint arXiv:2405.11106, 2024
-
[34]
Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, et al. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193, 2025
-
[35]
Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, et al. Can llms correct themselves? a benchmark of self-correction in llms.arXiv preprint arXiv:2510.16062, 2025
-
[37]
Reinforcement Learning for LLM Post-Training: A Survey
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, et al. Reinforcing multi-turn reasoning in llm agents via turn-level reward design.arXiv preprint arXiv:2505.11821, 2025
-
[39]
Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, et al. Critique-rl: Training language models for critiquing through two-stage reinforcement learning.arXiv preprint arXiv:2510.24320, 2025
-
[40]
Stepwiser: Stepwise generative judges for wiser reasoning.arXiv preprint arXiv:2508.19229, 2025
Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, and Sainbayar Sukhbaatar. Stepwiser: Stepwise generative judges for wiser reasoning.arXiv preprint arXiv:2508.19229, 2025
-
[41]
Earthse: A benchmark evaluating earth scientific exploration capability for large language models
Wanghan Xu, Xiangyu Zhao, Yuhao Zhou, Xiaoyu Yue, Ben Fei, Fenghua Ling, Wenlong Zhang, and Lei Bai. Earthse: A benchmark evaluating earth scientific exploration capability for large language models. InThe Fourteenth International Conference on Learning Representations, 2025
work page 2025
-
[42]
Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, et al. Probing scientific general intelligence of llms with scientist- aligned workflows.arXiv preprint arXiv:2512.16969, 2025
-
[43]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Chen Zhang, Xinyi Dai, Yaxiong Wu, Qu Yang, Yasheng Wang, Ruiming Tang, and Yong Liu. A survey on multi-turn interaction capabilities of large language models.arXiv preprint arXiv:2501.09959, 2025
-
[45]
Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, et al. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026
-
[46]
Critique-grpo: Advancing llm reasoning with natural language and numerical feedback
Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback. arXiv preprint arXiv:2506.03106, 2025
-
[47]
Zhongyue Zhang, Zijie Qiu, Yingcheng Wu, Shuya Li, Dingyan Wang, Yong Liu, Zhuomin Zhou, Yusong Hu, Yuhan Chen, Duo An, et al. Origene: A self-evolving virtual disease biologist automating therapeutic target discovery.BioRxiv, pages 2025–06, 2025. 14
work page 2025
-
[48]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Swift: a scalable lightweight infrastructure for fine-tuning
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025
work page 2025
-
[50]
Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, et al. Intern-s1-pro: Scientific multimodal foundation model at trillion scale.arXiv preprint arXiv:2603.25040, 2026. 15 A Training and Evaluation Details We use ms-swift [49] as the training framework. Tables 4–11 report the sett...
-
[52]
After </think>, provide only the final answer. For multiple-choice questions, output only the option label(s) (e.g., A, B, C, D, BC, DI, ADI, etc.) without including the option content or any explanatory phrases such as "\boxed{A}" or "The answer is C." Initial Solution <think> In cationic polymerization, monomer reactivity is governed by the stability of...
work page 2022
-
[54]
After </think>, provide only the final answer. For multiple-choice questions, output only the option label(s) (e.g., A, B, C, D, BC, DI, ADI, etc.) without including the option content or any explanatory phrases such as "\boxed{A}" or "The answer is C." Initial Solution <think> ABCA1 is a transporter involved in cholesterol efflux, and its suppression in ...
work page 2022
-
[55]
Clearly show your reasoning process enclosed within <think> and </think>
-
[56]
After </think>, provide only the final answer. For multiple-choice questions, output only the option label(s) (e.g., A, B, C, D, BC, DI, ADI, etc.) without including the option content or any explanatory phrases such as "\boxed{A}" or "The answer is C." Initial Solution <think> The question asks about the primary advantage of the one-parameter wing-scalin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.