HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
International conference on machine learning , pages=
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Shows that under differentiable rollouts with additive noise, actor updates in critic-free RL for LLMs are value-gradient-like in expectation, motivating a decomposition into value signal and reward headroom for when RL is most effective.
PerturbedVAE disentangles perturbation-specific signals from invariant gene expression structure to recover causal representations and improve out-of-distribution prediction in single-cell perturbation modeling.
citing papers explorer
-
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
-
Value-Gradient Hypothesis of RL for LLMs
Shows that under differentiable rollouts with additive noise, actor updates in critic-free RL for LLMs are value-gradient-like in expectation, motivating a decomposition into value signal and reward headroom for when RL is most effective.
-
What Makes a Representation Good for Single-Cell Perturbation Prediction?
PerturbedVAE disentangles perturbation-specific signals from invariant gene expression structure to recover causal representations and improve out-of-distribution prediction in single-cell perturbation modeling.