Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Fixed-interface state transfer provides stronger evidence of internal reuse in controlled routing than prompt retraining success alone.
citing papers explorer
-
Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization
Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.
-
State Transfer Reveals Reuse in Controlled Routing
Fixed-interface state transfer provides stronger evidence of internal reuse in controlled routing than prompt retraining success alone.