TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.
Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
citing papers explorer
-
Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.
- STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens