Weak-Driven Learning: How Weak Agents make Strong Agents Stronger

Deqing Wang; Fuzhen Zhuang; Gongxun Li; Jianxin Li; Tianxiang Ai; Wang Zhou; Xianglong Liu; XiaoDong Liu; Yifei Li; Yikun Ban

arxiv: 2602.08222 · v2 · pith:XDVNIDWInew · submitted 2026-02-09 · 💻 cs.AI

Weak-Driven Learning: How Weak Agents make Strong Agents Stronger

Zehao Chen , Gongxun Li , Tianxiang Ai , Zixuan Huang , Xiaodong Liu , Yifei Li , Wang Zhou , Fuzhen Zhuang

show 4 more authors

Xianglong Liu Jianxin Li Deqing Wang Yikun Ban

This is my paper

classification 💻 cs.AI

keywords agentsweaklearningmodelspost-trainingstrongmakeoptimization

0 comments

read the original abstract

As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once models grow highly confident, further training yields diminishing returns. While existing methods continue to reinforce target predictions, we find that informative supervision signals remain latent in models' own historical weak states. Motivated by this observation, we propose WMSS (Weak Agents Can Make Strong Agents Stronger), a post-training paradigm that leverages weak checkpoints to guide continued optimization. By identifying recoverable learning gaps via entropy dynamics and reinforcing them through compensatory learning, WMSS enables strong agents to improve beyond conventional post-training saturation. Experiments on mathematical reasoning and code generation datasets show that agents trained with our approach achieve effective performance improvements, while incurring zero additional inference cost.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Policy Improvement Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.