Recognition: no theorem link
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Pith reviewed 2026-05-13 01:15 UTC · model grok-4.3
The pith
Seirênes trains one LLM to generate its own distracting contexts and then solve the underlying problems despite them, producing gains of 7 to 10 points on mathematical reasoning benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Seirênes places a single model inside a parameter-shared adversarial self-play loop so that it must simultaneously construct distracting contexts that reveal its own reasoning gaps and solve the original problems by isolating the essential logic from those perturbations; the two opposing objectives are trained together with verifiable rewards, producing a co-evolutionary process that continues as the model improves.
What carries the argument
The parameter-shared adversarial self-play loop in which the model generates evolving distracting contexts while also learning to extract and solve the core problem from those contexts.
If this is right
- The same model scale achieves higher accuracy on standard clean benchmarks after exposure to self-generated distractions.
- Distractions created by a 4B-parameter model lower accuracy in much larger closed-source models by 4 to 5 points.
- The adversarial loop maintains an informative curriculum because each improvement in one objective immediately challenges the other.
- The method works across model sizes from 4B to 30B parameters without requiring separate generator and solver networks.
Where Pith is reading between the lines
- The approach could be tested on non-mathematical tasks such as code generation or multi-step planning where context noise is common.
- Distraction generators trained this way might serve as diagnostic tools to map blind spots in other reasoning systems.
- Long-term stability of the loop may require additional controls on distraction complexity to prevent collapse into trivial or repetitive noise.
Load-bearing premise
The measured gains come from the model learning genuinely more robust reasoning rather than from memorizing or adapting to the particular style of distractions that appear during its own training.
What would settle it
Test the trained models on math problems that contain entirely new categories of irrelevant or misleading text never produced by the self-play generator, such as human-written tangential instructions or novel incidental correlations.
read the original abstract
We present Seir\^enes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seir\^enes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seir\^enes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seir\^enes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seir\^enes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seir\^enes' general ability to uncover reasoning models' blind spots.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Seirênes, a parameter-shared adversarial self-play RL framework in which a single LLM is trained to both generate plausible distracting contexts for mathematical problems and to solve those problems by recovering the core logic from the perturbations. The central claim is that this co-evolutionary loop produces an evolving curriculum of non-trivial distractions that yields more robust reasoning, evidenced by average gains of +10.2, +9.1, and +7.2 points across seven math benchmarks for 4B–30B models and by the ability of 4B-generated distractions to reduce accuracy of GPT and Gemini models by 4–5 points.
Significance. If the gains are shown to arise from genuinely improved disambiguation rather than overfitting to the generator’s output distribution, the method would offer an efficient, single-model route to stress-testing and hardening LLM reasoning against realistic contextual noise. The reported transfer of distractions to closed-source models is a notable strength, as it suggests the generated perturbations expose general blind spots rather than model-specific artifacts.
major comments (3)
- [Results / Experimental Setup] The abstract and results sections report large average gains but supply no information on the baselines (standard SFT, non-adversarial RL, or data-augmentation controls), statistical tests, or variance across runs. Without these, it is impossible to determine whether the +7–10 point improvements are attributable to the adversarial loop or simply to additional gradient steps on the same math data.
- [Method / Ablations] No ablation isolates the adversarial self-play component (e.g., generator vs. fixed distraction policy, or RL with verifiable rewards alone). The central claim that the co-evolutionary curriculum forces generalizable reasoning therefore rests on an untested assumption; the observed gains could equally result from the solver simply learning to filter the particular style of perturbations produced by its own generator.
- [Training Dynamics / Evaluation] The manuscript provides no quantitative tracking of distraction quality over training (semantic distance from the core problem, diversity metrics, or per-epoch solver failure rates). Consequently there is no evidence that the curriculum actually evolves to become harder rather than collapsing to repetitive or easily ignored patterns, which directly undermines the claim of a “continuous interaction [that] sustains an informative co-evolutionary curriculum.”
minor comments (2)
- [Abstract] The seven mathematical reasoning benchmarks are not named in the abstract or early sections; listing them explicitly (with citations) would improve reproducibility.
- [Method] The notation for the two roles of the shared model (generator vs. solver) is introduced informally; a clear definition of the joint objective and reward signals would clarify the adversarial loop.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the presentation of our work. We address each major comment below, providing clarifications and committing to revisions where appropriate to enhance the experimental evidence for the co-evolutionary benefits of Seirênes.
read point-by-point responses
-
Referee: [Results / Experimental Setup] The abstract and results sections report large average gains but supply no information on the baselines (standard SFT, non-adversarial RL, or data-augmentation controls), statistical tests, or variance across runs. Without these, it is impossible to determine whether the +7–10 point improvements are attributable to the adversarial loop or simply to additional gradient steps on the same math data.
Authors: We acknowledge the importance of these controls for isolating the effect of the adversarial self-play. The manuscript emphasizes the novel framework but does not include direct comparisons to non-adversarial RL or SFT with equivalent compute. In the revised manuscript, we will incorporate baselines including standard supervised fine-tuning, RL with verifiable rewards but without the generator, and data augmentation using static distractions. We will also report standard deviations across 3-5 random seeds and perform statistical significance tests (e.g., paired t-tests) to substantiate the gains. revision: yes
-
Referee: [Method / Ablations] No ablation isolates the adversarial self-play component (e.g., generator vs. fixed distraction policy, or RL with verifiable rewards alone). The central claim that the co-evolutionary curriculum forces generalizable reasoning therefore rests on an untested assumption; the observed gains could equally result from the solver simply learning to filter the particular style of perturbations produced by its own generator.
Authors: This is a valid concern regarding the source of the improvements. To address it, we will add an ablation study in the revision where we compare the full Seirênes setup against a variant with a fixed generator policy (trained separately and frozen) and against standard RL without adversarial generation. These experiments will demonstrate whether the dynamic co-evolution is necessary for the observed robustness, particularly in the transfer to closed-source models. revision: yes
-
Referee: [Training Dynamics / Evaluation] The manuscript provides no quantitative tracking of distraction quality over training (semantic distance from the core problem, diversity metrics, or per-epoch solver failure rates). Consequently there is no evidence that the curriculum actually evolves to become harder rather than collapsing to repetitive or easily ignored patterns, which directly undermines the claim of a “continuous interaction [that] sustains an informative co-evolutionary curriculum.”
Authors: We agree that quantitative evidence of curriculum evolution would bolster the claims. Although the paper includes qualitative examples of distraction progression, we will add quantitative analyses in the revision, such as tracking the average embedding distance between generated distractions and core problems, lexical diversity metrics, and solver failure rates on a held-out set of problems over training epochs. This will provide direct support for the evolving nature of the curriculum. revision: yes
Circularity Check
Empirical self-play RL framework with external benchmark evaluation; no circular derivation
full rationale
The paper presents Seirênes as an adversarial self-play RL training procedure in which a parameter-shared model alternately generates distracting contexts and solves the underlying math problems. Reported results consist of average accuracy gains (+10.2, +9.1, +7.2 points) measured on seven independent mathematical reasoning benchmarks across model scales 4B–30B, plus transfer attacks on closed-source models. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are invoked to derive the gains; the method is a standard RL loop whose outputs are evaluated against fixed external test sets. This is a conventional empirical contribution whose central claims rest on observable benchmark deltas rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Homer.The Odyssey. Oxford World’s Classics. Oxford University Press, Oxford, 2008. ISBN 9780199536788. Translated by Walter Shewring; introduction by G. S. Kirk
work page 2008
-
[2]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Deepcoder: A fully open-source 14b coder at o3-mini level.Notion Blog, 1, 2025
Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, et al. Deepcoder: A fully open-source 14b coder at o3-mini level.Notion Blog, 1, 2025
work page 2025
-
[8]
Intern-s1: A scientific multimodal foun- dation model,
Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025
-
[9]
Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016
work page 2016
-
[10]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025
work page internal anchor Pith review arXiv 2025
-
[11]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025
work page internal anchor Pith review arXiv 2025
-
[12]
Spice: Self-play in corpus environments improves reasoning.arXiv, 2025
Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025
-
[13]
Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations
Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, et al. Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations. arXiv preprint arXiv:2502.06453, 2025
-
[14]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023
work page 2023
-
[15]
Benchmarking large language models in retrieval-augmented generation
Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. InAAAI Conference on Artificial Intelligence, 2023. URLhttps://api.semanticscholar.org/CorpusID: 261530434
work page 2023
-
[16]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419, 2025
work page internal anchor Pith review arXiv 2025
-
[17]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024. 11
work page 2024
-
[18]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025
-
[23]
Justrl: Scaling a 1.5 b llm with a simple rl recipe
Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025
-
[24]
Guided self-evolving llms with minimal human supervision.arXiv preprint arXiv:2512.02472,
Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, and Dong Yu. Guided self-evolving llms with minimal human supervision.arXiv preprint arXiv:2512.02472, 2025
-
[25]
Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning
Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning.arXiv preprint arXiv:2511.16043, 2025
-
[26]
arXiv preprint arXiv:2511.19900 , year=
Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, and Huaxiu Yao. Agent0-vl: Exploring self-evolving agent for tool-integrated vision-language reasoning.arXiv preprint arXiv:2511.19900, 2025
-
[27]
Visplay: Self-evolving vision-language models from images,
Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision- language models from images.arXiv preprint arXiv:2511.15661, 2025
-
[28]
Learning to hint for reinforcement learning
Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, and Yuxiong He. Learning to hint for reinforcement learning. arXiv preprint arXiv:2604.00698, 2026
-
[29]
Self-hinting language models enhance reinforcement learning, 2026
Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.arXiv preprint arXiv:2602.03143, 2026
-
[30]
Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. Stephint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025
-
[31]
Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning, 2026
Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, and Jiaya Jia. Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning.arXiv preprint arXiv:2510.19807, 2025
-
[32]
Matthew YR Yang, Hao Bai, Ian Wu, Gene Yang, Amrith Setlur, and Aviral Kumar. Int: Self-proposed interventions enable credit assignment in llm reasoning.arXiv preprint arXiv:2601.14209, 2026
-
[33]
arXiv preprint arXiv:2504.14945 , year =
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025
-
[34]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Adversarial examples for evaluating reading comprehension systems
Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 2021–2031, 2017
work page 2017
-
[36]
Red teaming language models with language models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, 2022
work page 2022
-
[37]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022. 12
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025
work page Pith review arXiv 2025
-
[40]
American invitational mathematics examination (aime) 2024.https: //maa.org, February 2024
Mathematical Association of America. American invitational mathematics examination (aime) 2024.https: //maa.org, February 2024. Competition problems and official solutions
work page 2024
-
[41]
American invitational mathematics examination (aime) 2025.https: //maa.org, February 2025
Mathematical Association of America. American invitational mathematics examination (aime) 2025.https: //maa.org, February 2025. Competition problems and official solutions
work page 2025
-
[42]
American invitational mathematics examination (aime) 2026.https: //maa.org, February 2026
Mathematical Association of America. American invitational mathematics examination (aime) 2026.https: //maa.org, February 2026. Competition problems and official solutions
work page 2026
-
[43]
Towards robust mathematical reasoning
Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025
work page 2025
-
[44]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022
work page 2022
-
[45]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...
work page 2024
-
[46]
Matharena: Evaluating llms on uncontaminated math competitions, February 2025
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/
work page 2025
-
[47]
Exgrpo: Learning to reason from experience
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F Wong, and Yu Cheng. Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245, 2025
-
[48]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Open r1: A fully open reproduction of deepseek-r1, January 2025
Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URLhttps://github.com/ huggingface/open-r1
work page 2025
-
[51]
Gemini 3 flash: frontier intelligence built for speed, 2025
Gemini Team. Gemini 3 flash: frontier intelligence built for speed, 2025. URLhttps://deepmind.google/models/ gemini/flash/
work page 2025
-
[52]
Ming Jiang, Tingting Huang, Biao Guo, Yao Lu, and Feng Zhang. Enhancing robustness in large language models: Prompting for mitigating the impact of irrelevant information. InInternational Conference on Neural Information Processing, pages 207–222. Springer, 2024
work page 2024
-
[53]
Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Pin-Yu Chen, Xiuying Chen, and Xiangliang Zhang. Adaptive distraction: Probing llm contextual robustness with automated tree search.arXiv preprint arXiv:2502.01609, 2025. 13
-
[54]
Large language models can be easily distracted by irrelevant context
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InInternational Conference on Machine Learning, pages 31210–31227. PMLR, 2023
work page 2023
-
[55]
Universal adversarial triggers for attacking and analyzing nlp
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153–2162, 2019
work page 2019
-
[56]
Gpt-5.1: A smarter, more conversational chatgpt, 2025
OpenAI Team. Gpt-5.1: A smarter, more conversational chatgpt, 2025. URLhttps://openai.com/index/gpt-5-1/ /
work page 2025
-
[57]
Robust adversarial reinforcement learning
Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. InInternational conference on machine learning, pages 2817–2826. PMLR, 2017
work page 2017
-
[58]
Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine. Emergent complexity and zero-shot transfer via unsupervised environment design.Advances in neural information processing systems, 33:13049–13061, 2020
work page 2020
-
[59]
Jailbreaking black box large language models in twenty queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025
work page 2025
-
[60]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024
work page internal anchor Pith review arXiv 2024
-
[62]
Yuqing Wang and Yun Zhao. Rupbench: Benchmarking reasoning under perturbations for robustness evaluation in large language models.arXiv preprint arXiv:2406.11020, 2024
-
[63]
Benchmarking reasoning robustness in large language models.arXiv preprint arXiv:2503.04550, 2025
Tong Yu, Yongcheng Jing, Xikun Zhang, Wentao Jiang, Wenjie Wu, Yingjie Wang, Wenbin Hu, Bo Du, and Dacheng Tao. Benchmarking reasoning robustness in large language models.arXiv preprint arXiv:2503.04550, 2025
-
[64]
On memorization of large language models in logical reasoning
Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Li...
work page 2025
-
[65]
Garbage in, reasoning out? why benchmark scores are unreliable and what to do about it
Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Hornikova, and Giuseppe Riccardi. Garbage in, reasoning out? why benchmark scores are unreliable and what to do about it. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1747–1759, 2026
work page 2026
-
[66]
Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models?arXiv preprint arXiv:2404.03302, 2024
-
[67]
The distracting effect: Understanding irrelevant passages in rag
Chen Amiraz, Florin Cuconasu, Simone Filice, and Zohar Karnin. The distracting effect: Understanding irrelevant passages in rag. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18228–18258, 2025
work page 2025
-
[68]
Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, and Minjoon Seo. Lost in the noise: How reasoning models fail with contextual distractors.arXiv preprint arXiv:2601.07226, 2026
-
[69]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[71]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 14
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[72]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[73]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018
work page 2018
-
[74]
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024. 15 A Extended related work Adversarial examples for language u...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.