UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.
Genarm: Reward guided generation with autoregressive reward model for test-time alignment
5 Pith papers cite this work. Polarity classification is still indexing.
5
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
Reward-weighted classifier-free guidance approximates Q-function policy improvement in autoregressive models, enabling test-time reward optimization and faster RL convergence via distillation.