Varr(ˆgout|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r out) =C τ ·σ 2 out (21) Varr(ˆgsem|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r sem) =C τ ·σ 2 sem (22) whereC τ =∥∇ θ logπ(τ)∥ 2

Variance due to Reward Noise (Second Term):The second term,E τ[Varr(ˆg|τ)], represents the variance due to reward signal noise for afixedtrajectory

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

citing papers explorer

Showing 1 of 1 citing paper.

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 78
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

Varr(ˆgout|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r out) =C τ ·σ 2 out (21) Varr(ˆgsem|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r sem) =C τ ·σ 2 sem (22) whereC τ =∥∇ θ logπ(τ)∥ 2

fields

years

verdicts

representative citing papers

citing papers explorer