A new Gym environment for medical AI agents reveals collapse in multi-turn RL due to sparse rewards, addressed by Turn-level Truncated On-Policy Distillation yielding +3.9 pp gains on clinical benchmarks.
PathVQA (He et al., 2020) shows TT-OPD at 45.3%, outperforming both base text (40.5%) and GRPO (41.5%)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Healthcare AI GYM for Medical Agents
A new Gym environment for medical AI agents reveals collapse in multi-turn RL due to sparse rewards, addressed by Turn-level Truncated On-Policy Distillation yielding +3.9 pp gains on clinical benchmarks.