Self-play only evolves when self-synthetic pipeline ensures learnable information gain.arXiv preprint arXiv:2603.02218

Wei Liu, Siya Qi, Yali Du, Yulan He · 2026 · cs.LG · arXiv 2603.02218

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

What Do Evolutionary Coding Agents Evolve?

cs.NE · 2026-05-19 · unverdicted · novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Experiments on coding and deterministic tasks demonstrate that data gating is sufficient for self-play stability while reward variants are not, revealing the Grounded Proposer Paradox and a two-stage phase transition under continuous gate strictness.

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

cs.MA · 2026-04-19 · unverdicted · novelty 5.0

ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.

citing papers explorer

Showing 5 of 5 citing papers.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 27 · internal anchor
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
What Do Evolutionary Coding Agents Evolve? cs.NE · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL cs.LG · 2026-05-21 · unverdicted · none · ref 28 · internal anchor
Experiments on coding and deterministic tasks demonstrate that data gating is sufficient for self-play stability while reward variants are not, revealing the Grounded Proposer Paradox and a two-stage phase transition under continuous gate strictness.
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution cs.CL · 2026-04-03 · unverdicted · none · ref 22 · internal anchor
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
Towards Self-Improving Error Diagnosis in Multi-Agent Systems cs.MA · 2026-04-19 · unverdicted · none · ref 63 · internal anchor
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.

Self-play only evolves when self-synthetic pipeline ensures learnable information gain.arXiv preprint arXiv:2603.02218

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer