Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
Pith reviewed 2026-05-07 07:38 UTC · model grok-4.3
The pith
Inserting a black-box on-policy distillation stage after SFT corrects distributional drift and raises final multimodal RL accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PRISM is a three-stage pipeline that places an on-policy distillation alignment stage between SFT and RLVR. In this stage the policy plays a black-box adversarial game against a perception-reasoning MoE discriminator that returns disentangled corrective signals at the response level, steering the policy toward the supervision distribution without any access to teacher logits or internal states. The resulting policy, when passed to RLVR, yields higher final accuracy on multimodal reasoning tasks.
What carries the argument
Black-box on-policy distillation game against a Mixture-of-Experts discriminator whose perception expert and reasoning expert separately score responses and return disentangled corrective signals.
If this is right
- The aligned policy improves average accuracy after RLVR by 4.4 points on 4B models and 6.0 points on 8B models across GRPO, DAPO, and GSPO.
- The gains appear consistently across diverse multimodal benchmarks when the same RL algorithms are used.
- Only 113K additional high-fidelity demonstrations are required for the alignment stage beyond the 1.26M public demonstrations used for SFT.
- Disentangled perception and reasoning signals allow the alignment step to target distinct error types that compound during later RL.
Where Pith is reading between the lines
- The same pre-alignment idea could be inserted into training pipelines for non-multimodal or non-reasoning tasks where SFT drift is known to occur.
- The curation of dense visual-grounding and step-by-step demonstrations may be more important than raw quantity at the alignment stage, which would change how future supervision data is collected.
- If the MoE discriminator can reliably separate perception from reasoning errors, it could also be repurposed as a diagnostic tool to filter or label existing multimodal datasets.
Load-bearing premise
The perception-reasoning MoE discriminator can generate useful corrective signals that steer the policy toward the supervision distribution using only response-level outputs and no access to teacher logits or internal model states.
What would settle it
An ablation that replaces the two-expert MoE discriminator with a single undifferentiated expert and then measures whether the reported accuracy gains after RLVR disappear on the same benchmarks and model sizes.
Figures
read the original abstract
The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PRISM, a three-stage pipeline for large multimodal models that inserts an explicit black-box on-policy distillation alignment stage—using a response-level adversarial game against a Mixture-of-Experts discriminator with separate perception and reasoning experts—between standard SFT on 1.26M public demonstrations and subsequent RLVR. An additional 113K high-fidelity demonstrations are curated from Gemini 3 Flash (dense visual grounding and step-by-step reasoning on hard problems) specifically for the alignment stage. Experiments on Qwen3-VL 4B and 8B models report consistent gains of +4.4 and +6.0 average accuracy points over the SFT-to-RLVR baseline across GRPO, DAPO, and GSPO on diverse multimodal benchmarks, with code, data, and checkpoints released publicly.
Significance. If the reported gains can be isolated to the proposed alignment mechanism, PRISM would offer a practical, logit-free method for mitigating distributional drift and perception-reasoning compounding errors in multimodal RLVR pipelines. The public release of code, data, and model checkpoints is a clear strength that supports reproducibility and follow-up work.
major comments (3)
- [Experiments] Experiments section: The central empirical claim attributes the +4.4 / +6.0 point lifts to the PRISM alignment stage. However, no ablation is presented that applies the 113K Gemini-curated demonstrations directly to the SFT initialization (or as additional SFT data before RLVR) without the black-box on-policy distillation game. This control is required to determine whether the gains arise from the adversarial MoE mechanism and its claimed disentangled signals or simply from the higher-fidelity supervision data.
- [Method] Method section (PRISM pipeline description): The black-box on-policy distillation game is described at a high level, but the manuscript supplies no concrete details on (i) the loss used to train the perception and reasoning experts of the MoE discriminator, (ii) how the response-level adversarial objective is optimized, or (iii) the precise form of the corrective signals passed back to the policy. These omissions make it difficult to assess whether the claimed disentanglement is achieved in practice.
- [Experiments] Experiments section: The reported average accuracy improvements are presented without stating the number of evaluation runs, random seeds, statistical significance tests, or per-benchmark breakdowns. In the absence of these details it is impossible to judge the reliability or robustness of the cross-algorithm and cross-model claims.
minor comments (1)
- [Abstract] The abstract and method sections refer to 'diverse multimodal benchmarks' without enumerating them; a short list or table reference would improve immediate clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We appreciate the positive assessment of the public release of code, data, and checkpoints. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: Experiments section: The central empirical claim attributes the +4.4 / +6.0 point lifts to the PRISM alignment stage. However, no ablation is presented that applies the 113K Gemini-curated demonstrations directly to the SFT initialization (or as additional SFT data before RLVR) without the black-box on-policy distillation game. This control is required to determine whether the gains arise from the adversarial MoE mechanism and its claimed disentangled signals or simply from the higher-fidelity supervision data.
Authors: We agree that this control experiment is necessary to isolate the contribution of the black-box on-policy distillation game from the effect of higher-fidelity data. In the revised manuscript we will add an ablation that incorporates the 113K Gemini-curated demonstrations directly into the SFT stage (or as additional SFT data) followed by standard RLVR, without the adversarial MoE alignment stage. This will be reported alongside the existing results to clarify the source of the observed gains. revision: yes
-
Referee: Method section (PRISM pipeline description): The black-box on-policy distillation game is described at a high level, but the manuscript supplies no concrete details on (i) the loss used to train the perception and reasoning experts of the MoE discriminator, (ii) how the response-level adversarial objective is optimized, or (iii) the precise form of the corrective signals passed back to the policy. These omissions make it difficult to assess whether the claimed disentanglement is achieved in practice.
Authors: We acknowledge that the current description is high-level. The concrete implementation details for the MoE discriminator losses, the optimization of the response-level adversarial objective, and the form of the corrective signals are present in the publicly released code. In the revised manuscript we will expand the Methods section to document these elements explicitly, including the training objectives for the perception and reasoning experts, the minimax optimization schedule, and the exact form of the disentangled signals passed to the policy. revision: yes
-
Referee: Experiments section: The reported average accuracy improvements are presented without stating the number of evaluation runs, random seeds, statistical significance tests, or per-benchmark breakdowns. In the absence of these details it is impossible to judge the reliability or robustness of the cross-algorithm and cross-model claims.
Authors: We agree that these experimental details are required to assess reliability. In the revised manuscript we will report the number of evaluation runs performed, the random seeds used, the outcomes of statistical significance tests (e.g., paired t-tests across seeds), and full per-benchmark tables that include means and standard deviations for all models and RL algorithms. revision: yes
Circularity Check
No circularity: purely empirical pipeline with no derivation chain
full rationale
The paper describes a three-stage empirical training pipeline (SFT initialization on 1.26M demos, followed by black-box on-policy distillation alignment using a perception-reasoning MoE discriminator on 113K curated Gemini examples, then RLVR) and reports measured accuracy gains on external multimodal benchmarks. No mathematical equations, first-principles derivations, or analytic identities are presented that could reduce the claimed +4.4/+6.0 point improvements to quantities defined by fitted parameters, self-referential definitions, or self-citation chains. The additional high-fidelity data is explicitly part of the PRISM method rather than a hidden input; results are concrete training outcomes rather than predictions forced by construction. No self-definitional steps, fitted-input predictions, or ansatzes smuggled via citation appear in the abstract or method description. The work is therefore self-contained against its reported benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Curated alignment dataset size
axioms (2)
- domain assumption A response-level adversarial game between policy and MoE discriminator can produce disentangled corrective signals for perception and reasoning without access to teacher logits
- domain assumption The additional 113K demonstrations constitute higher-fidelity supervision than the 1.26M public demonstrations used for SFT
invented entities (1)
-
Mixture-of-Experts discriminator with dedicated perception and reasoning experts
no independent evidence
Forward citations
Cited by 2 Pith papers
-
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
-
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.