Spell: Self-play reinforcement learning for evolving long-context language models

Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan, Ming Yan, Xiaojun Quan, Fei Huang · 2025 · arXiv 2509.23863

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

SELF-EMO lets LLMs bootstrap better emotion recognition and expression via self-play, data flywheel filtering with smoothed IoU rewards, and SELF-GRPO reinforcement learning, yielding SOTA gains on IEMOCAP, MELD, and EmoryNLP.

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

cs.LG · 2026-05-16 · unverdicted · novelty 5.0

D²Evo mines medium-difficulty anchors from the current model, trains a Questioner to generate matching questions, and jointly optimizes Solver and Questioner for progressive gains, outperforming baselines on math reasoning with under 2K real samples.

citing papers explorer

Showing 3 of 3 citing papers.

SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression cs.AI · 2026-04-20 · unverdicted · none · ref 20
SELF-EMO lets LLMs bootstrap better emotion recognition and expression via self-play, data flywheel filtering with smoothed IoU rewards, and SELF-GRPO reinforcement learning, yielding SOTA gains on IEMOCAP, MELD, and EmoryNLP.
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models cs.CV · 2026-05-21 · unverdicted · none · ref 9
EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.
D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning cs.LG · 2026-05-16 · unverdicted · none · ref 44
D²Evo mines medium-difficulty anchors from the current model, trains a Questioner to generate matching questions, and jointly optimizes Solver and Questioner for progressive gains, outperforming baselines on math reasoning with under 2K real samples.

Spell: Self-play reinforcement learning for evolving long-context language models

fields

years

verdicts

representative citing papers

citing papers explorer