Diversity collapse in LLMs arises from order and shape miscalibration in token probability distributions at inference time, not from sampling methods.
Jointly reinforcing diversity and quality in language model generations
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
citing papers explorer
-
Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
Diversity collapse in LLMs arises from order and shape miscalibration in token probability distributions at inference time, not from sampling methods.
-
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.