Musr: Testing the limits of chain-of- thought with multistep soft reasoning, 2024

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.LG · 2025-10-27 · unverdicted · novelty 5.0

GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.

Showing 1 of 1 citing paper.

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA cs.LG · 2025-10-27 · unverdicted · none · ref 23
GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.