Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Beier Zhu; Chaojun Xiao; Chen Chen; Chengwei Qin; Hehai Lin; Keming Wu; Sudong Wang; Weiquan Huang; Wenxuan Wang; Xiaomin Yu

arxiv: 2604.28123 · v2 · submitted 2026-04-30 · 💻 cs.CV · cs.AI· cs.CL

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Sudong Wang , Weiquan Huang , Xiaomin Yu , Zuhao Yang , Hehai Lin , Keming Wu , Chaojun Xiao , Chen Chen

show 4 more authors

Wenxuan Wang Beier Zhu Yunjian Zhang Chengwei Qin

This is my paper

Pith reviewed 2026-05-07 07:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords multimodal reasoningreinforcement learningon-policy distillationdistributional driftsupervised fine-tuningmixture of expertsPRISM pipelineQwen3-VL

0 comments

The pith

Inserting a black-box on-policy distillation stage after SFT corrects distributional drift and raises final multimodal RL accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard post-training for large multimodal models uses supervised fine-tuning on demonstrations followed by reinforcement learning with verifiable rewards. The SFT step introduces distributional drift that neither preserves original capabilities nor matches the target supervision, and this drift is worse in multimodal reasoning because perception errors and reasoning errors follow separate patterns that compound later. PRISM inserts an explicit alignment phase that frames the problem as a black-box adversarial game: the policy generates responses while a mixture-of-experts discriminator with dedicated perception and reasoning experts supplies response-level corrective signals. The alignment uses 1.26 million public demonstrations plus 113 thousand newly curated high-fidelity examples from Gemini 3 Flash. Experiments on Qwen3-VL models show that the aligned policy then produces higher accuracy after RLVR across three different RL algorithms and multiple benchmarks.

Core claim

PRISM is a three-stage pipeline that places an on-policy distillation alignment stage between SFT and RLVR. In this stage the policy plays a black-box adversarial game against a perception-reasoning MoE discriminator that returns disentangled corrective signals at the response level, steering the policy toward the supervision distribution without any access to teacher logits or internal states. The resulting policy, when passed to RLVR, yields higher final accuracy on multimodal reasoning tasks.

What carries the argument

Black-box on-policy distillation game against a Mixture-of-Experts discriminator whose perception expert and reasoning expert separately score responses and return disentangled corrective signals.

If this is right

The aligned policy improves average accuracy after RLVR by 4.4 points on 4B models and 6.0 points on 8B models across GRPO, DAPO, and GSPO.
The gains appear consistently across diverse multimodal benchmarks when the same RL algorithms are used.
Only 113K additional high-fidelity demonstrations are required for the alignment stage beyond the 1.26M public demonstrations used for SFT.
Disentangled perception and reasoning signals allow the alignment step to target distinct error types that compound during later RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-alignment idea could be inserted into training pipelines for non-multimodal or non-reasoning tasks where SFT drift is known to occur.
The curation of dense visual-grounding and step-by-step demonstrations may be more important than raw quantity at the alignment stage, which would change how future supervision data is collected.
If the MoE discriminator can reliably separate perception from reasoning errors, it could also be repurposed as a diagnostic tool to filter or label existing multimodal datasets.

Load-bearing premise

The perception-reasoning MoE discriminator can generate useful corrective signals that steer the policy toward the supervision distribution using only response-level outputs and no access to teacher logits or internal model states.

What would settle it

An ablation that replaces the two-expert MoE discriminator with a single undifferentiated expert and then measures whether the reported accuracy gains after RLVR disappear on the same benchmarks and model sizes.

Figures

Figures reproduced from arXiv: 2604.28123 by Beier Zhu, Chaojun Xiao, Chen Chen, Chengwei Qin, Hehai Lin, Keming Wu, Sudong Wang, Weiquan Huang, Wenxuan Wang, Xiaomin Yu, Yunjian Zhang, Zuhao Yang.

**Figure 1.** Figure 1: Overview of the PRISM pipeline. (a) SFT introduces distributional drift between the policy view at source ↗

**Figure 2.** Figure 2: Architecture of the distribution-alignment stage. An MoE discriminator with perception view at source ↗

**Figure 3.** Figure 3: Training dynamics: reward gap (supervision view at source ↗

**Figure 4.** Figure 4: Structural proxies of distribution alignment: rea view at source ↗

**Figure 5.** Figure 5: Token efficiency comparison on MathVision, MathVerse, and MMMU-Pro (Qwen3-VL-4B). view at source ↗

**Figure 6.** Figure 6: System Prompt. This system prompt is shared across SFT, RL training, and benchmark evaluation to enforce the structured three-part output format (<caption>, <think>, <answer>). D Full Training Procedure We provide the complete PRISM training procedure in Algorithm 1. The pipeline consists of three sequential stages. Stage 1 performs standard SFT on the combined corpus to obtain an initial policy πsft. Stag… view at source ↗

**Figure 7.** Figure 7: Data Distillation Prompt. The full prompt used to query Gemini 3 Flash for generating high-quality multimodal reasoning demonstrations with explicit rules for visual extraction, reasoning traces, and concise answers. 20 view at source ↗

**Figure 8.** Figure 8: Judge Model Prompt. The prompt used for LLM-as-judge evaluation, where Qwen3-30BA3B-Instruct compares the model’s extracted answer against the ground truth. 21 view at source ↗

**Figure 9.** Figure 9: An example of a cold-start data sample. 22 view at source ↗

**Figure 10.** Figure 10: An example of a cold-start data sample. 23 view at source ↗

**Figure 11.** Figure 11: An example of our model inference result. view at source ↗

**Figure 12.** Figure 12: An example of our model inference result. view at source ↗

read the original abstract

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM adds a plausible alignment stage with a perception-reasoning MoE discriminator but the reported gains are hard to credit to the method without an ablation on the extra 113K Gemini data.

read the letter

The main thing to know is that this paper inserts an explicit on-policy distillation stage called PRISM between SFT and RLVR for large multimodal models. It uses a black-box adversarial game with a Mixture-of-Experts discriminator that has separate experts for perception and reasoning, and it reports average accuracy lifts of 4.4 and 6.0 points on 4B and 8B Qwen3-VL models across GRPO, DAPO, and GSPO. The public release of code, data, and checkpoints is a clear positive.

Referee Report

3 major / 1 minor

Summary. The paper proposes PRISM, a three-stage pipeline for large multimodal models that inserts an explicit black-box on-policy distillation alignment stage—using a response-level adversarial game against a Mixture-of-Experts discriminator with separate perception and reasoning experts—between standard SFT on 1.26M public demonstrations and subsequent RLVR. An additional 113K high-fidelity demonstrations are curated from Gemini 3 Flash (dense visual grounding and step-by-step reasoning on hard problems) specifically for the alignment stage. Experiments on Qwen3-VL 4B and 8B models report consistent gains of +4.4 and +6.0 average accuracy points over the SFT-to-RLVR baseline across GRPO, DAPO, and GSPO on diverse multimodal benchmarks, with code, data, and checkpoints released publicly.

Significance. If the reported gains can be isolated to the proposed alignment mechanism, PRISM would offer a practical, logit-free method for mitigating distributional drift and perception-reasoning compounding errors in multimodal RLVR pipelines. The public release of code, data, and model checkpoints is a clear strength that supports reproducibility and follow-up work.

major comments (3)

[Experiments] Experiments section: The central empirical claim attributes the +4.4 / +6.0 point lifts to the PRISM alignment stage. However, no ablation is presented that applies the 113K Gemini-curated demonstrations directly to the SFT initialization (or as additional SFT data before RLVR) without the black-box on-policy distillation game. This control is required to determine whether the gains arise from the adversarial MoE mechanism and its claimed disentangled signals or simply from the higher-fidelity supervision data.
[Method] Method section (PRISM pipeline description): The black-box on-policy distillation game is described at a high level, but the manuscript supplies no concrete details on (i) the loss used to train the perception and reasoning experts of the MoE discriminator, (ii) how the response-level adversarial objective is optimized, or (iii) the precise form of the corrective signals passed back to the policy. These omissions make it difficult to assess whether the claimed disentanglement is achieved in practice.
[Experiments] Experiments section: The reported average accuracy improvements are presented without stating the number of evaluation runs, random seeds, statistical significance tests, or per-benchmark breakdowns. In the absence of these details it is impossible to judge the reliability or robustness of the cross-algorithm and cross-model claims.

minor comments (1)

[Abstract] The abstract and method sections refer to 'diverse multimodal benchmarks' without enumerating them; a short list or table reference would improve immediate clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We appreciate the positive assessment of the public release of code, data, and checkpoints. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: Experiments section: The central empirical claim attributes the +4.4 / +6.0 point lifts to the PRISM alignment stage. However, no ablation is presented that applies the 113K Gemini-curated demonstrations directly to the SFT initialization (or as additional SFT data before RLVR) without the black-box on-policy distillation game. This control is required to determine whether the gains arise from the adversarial MoE mechanism and its claimed disentangled signals or simply from the higher-fidelity supervision data.

Authors: We agree that this control experiment is necessary to isolate the contribution of the black-box on-policy distillation game from the effect of higher-fidelity data. In the revised manuscript we will add an ablation that incorporates the 113K Gemini-curated demonstrations directly into the SFT stage (or as additional SFT data) followed by standard RLVR, without the adversarial MoE alignment stage. This will be reported alongside the existing results to clarify the source of the observed gains. revision: yes
Referee: Method section (PRISM pipeline description): The black-box on-policy distillation game is described at a high level, but the manuscript supplies no concrete details on (i) the loss used to train the perception and reasoning experts of the MoE discriminator, (ii) how the response-level adversarial objective is optimized, or (iii) the precise form of the corrective signals passed back to the policy. These omissions make it difficult to assess whether the claimed disentanglement is achieved in practice.

Authors: We acknowledge that the current description is high-level. The concrete implementation details for the MoE discriminator losses, the optimization of the response-level adversarial objective, and the form of the corrective signals are present in the publicly released code. In the revised manuscript we will expand the Methods section to document these elements explicitly, including the training objectives for the perception and reasoning experts, the minimax optimization schedule, and the exact form of the disentangled signals passed to the policy. revision: yes
Referee: Experiments section: The reported average accuracy improvements are presented without stating the number of evaluation runs, random seeds, statistical significance tests, or per-benchmark breakdowns. In the absence of these details it is impossible to judge the reliability or robustness of the cross-algorithm and cross-model claims.

Authors: We agree that these experimental details are required to assess reliability. In the revised manuscript we will report the number of evaluation runs performed, the random seeds used, the outcomes of statistical significance tests (e.g., paired t-tests across seeds), and full per-benchmark tables that include means and standard deviations for all models and RL algorithms. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical pipeline with no derivation chain

full rationale

The paper describes a three-stage empirical training pipeline (SFT initialization on 1.26M demos, followed by black-box on-policy distillation alignment using a perception-reasoning MoE discriminator on 113K curated Gemini examples, then RLVR) and reports measured accuracy gains on external multimodal benchmarks. No mathematical equations, first-principles derivations, or analytic identities are presented that could reduce the claimed +4.4/+6.0 point improvements to quantities defined by fitted parameters, self-referential definitions, or self-citation chains. The additional high-fidelity data is explicitly part of the PRISM method rather than a hidden input; results are concrete training outcomes rather than predictions forced by construction. No self-definitional steps, fitted-input predictions, or ansatzes smuggled via citation appear in the abstract or method description. The work is therefore self-contained against its reported benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the effectiveness of a newly introduced MoE discriminator architecture and the assumption that 113K curated demonstrations from Gemini 3 Flash provide sufficiently higher-fidelity supervision for alignment. No explicit free parameters are fitted inside a derivation; the main dependencies are architectural choices and data curation decisions.

free parameters (1)

Curated alignment dataset size
113K additional demonstrations were selected and curated from Gemini 3 Flash specifically for the alignment stage; the exact count and selection criteria are chosen by hand rather than derived.

axioms (2)

domain assumption A response-level adversarial game between policy and MoE discriminator can produce disentangled corrective signals for perception and reasoning without access to teacher logits
Invoked to justify the black-box on-policy distillation stage.
domain assumption The additional 113K demonstrations constitute higher-fidelity supervision than the 1.26M public demonstrations used for SFT
Stated as the reason the alignment stage requires separate curation.

invented entities (1)

Mixture-of-Experts discriminator with dedicated perception and reasoning experts no independent evidence
purpose: To supply disentangled corrective signals during the alignment stage
New architectural component introduced in the PRISM pipeline; no independent external evidence for its effectiveness is provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5646 in / 1876 out tokens · 82053 ms · 2026-05-07T07:38:02.491032+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[4]

Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2048

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

[4] [4]

Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2048