Rl with kl penalties is better viewed as bayesian inference

URL https://arxiv · 2022 · arXiv 2205.11275

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

A Unifying Lens on Reward Uncertainty in RLHF

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

A distributional reward model p(r|x,y) yields the closed-form effective reward ilde r(x,y) = eta ext{log} ext{E}_p[e^{r/eta}] (pessimistic branch) that unifies prior RLHF aggregation heuristics under Bayesian or KL-DRO views.

Binary Rewards and Reinforcement Learning: Fundamental Challenges

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.

Scaling Laws for Reward Model Overoptimization

cs.LG · 2022-10-19 · unverdicted · novelty 6.0

Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimization under different signals.

Exponential families from a single KL identity

cs.LG · 2026-04-30 · accept · novelty 5.0

One KL-difference identity plus non-negativity of KL derives convexity of the log-partition function, Gibbs variational principle, Pythagorean theorems, and tilting formulas for exponential families.

citing papers explorer

Showing 1 of 1 citing paper after filters.

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective cs.AI · 2026-05-08 · unverdicted · none · ref 30
Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimization under different signals.

Rl with kl penalties is better viewed as bayesian inference

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer