A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
Advances in neural information processing systems35, 26337–26349 (2022)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Improving Medical VQA through Trajectory-Aware Process Supervision
A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.