HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
International conference on machine learning , pages=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
Shows that under differentiable rollouts with additive noise, actor updates in critic-free RL for LLMs are value-gradient-like in expectation, motivating a decomposition into value signal and reward headroom for when RL is most effective.
PerturbedVAE disentangles perturbation-specific signals from invariant gene expression structure to recover causal representations and improve out-of-distribution prediction in single-cell perturbation modeling.
citing papers explorer
-
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
-
Value-Gradient Hypothesis of RL for LLMs
Shows that under differentiable rollouts with additive noise, actor updates in critic-free RL for LLMs are value-gradient-like in expectation, motivating a decomposition into value signal and reward headroom for when RL is most effective.
-
What Makes a Representation Good for Single-Cell Perturbation Prediction?
PerturbedVAE disentangles perturbation-specific signals from invariant gene expression structure to recover causal representations and improve out-of-distribution prediction in single-cell perturbation modeling.