Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

Bibo Cai; Bing Qin; Hepeng Wang; Jinglong Gao; Kai Xiong; Li Du; Ting Liu; Xiao Ding; Yangou Ouyang; Yang Zhao

arxiv: 2601.07224 · v2 · submitted 2026-01-12 · 💻 cs.AI · cs.LG

Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

Yang Zhao , Yangou Ouyang , Xiao Ding , Hepeng Wang , Bibo Cai , Kai Xiong , Jinglong Gao , Zhouhao Sun

show 3 more authors

Li Du Bing Qin Ting Liu

This is my paper

Pith reviewed 2026-05-16 15:35 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords PRISMgradient concentrationSFTRLLLM agentsdata allocationSchema Theoryhybrid training

0 comments

The pith

PRISM routes each data point to SFT or RL by measuring whether its gradients concentrate spatially, matching data to the right optimization regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PRISM as a way to decide data allocation between supervised fine-tuning and reinforcement learning when training LLM agents. It treats SFT as consolidation of existing patterns and RL as structural adaptation, and uses the spatial geometry of gradients to detect how much a sample conflicts with what the model already knows. Samples whose gradients concentrate in a narrow region are routed to RL because they demand exploration and change; samples that produce diffuse updates go to SFT for efficient imitation. On WebShop and ALFWorld this routing produces stronger final performance than prior hybrid schedules while cutting total compute by as much as 3.22 times.

Core claim

PRISM is a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation.

What carries the argument

Spatial geometric structure of gradients, where high concentration signals cognitive conflict requiring RL adaptation and diffuse spread signals suitability for SFT consolidation.

If this is right

Outperforms state-of-the-art hybrid SFT-then-RL methods on WebShop and ALFWorld.
Reduces computational costs by up to 3.22 times through targeted data routing.
Prevents optimization interference by aligning each sample with the functional role of SFT or RL.
Supports more scalable agent alignment by disentangling data according to internal optimization regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradient-concentration test could be used to decide data order or phase in other multi-stage training pipelines beyond SFT and RL.
If concentration truly tracks conflict, the metric might help detect when a model is ready for new tasks in continual-learning settings.
The approach suggests that future data-selection methods should inspect optimization geometry rather than rely only on surface statistics.

Load-bearing premise

The spatial geometric structure of gradients reliably indicates the degree of cognitive conflict with the model's existing knowledge.

What would settle it

An experiment that shows data labeled high-conflict by gradient concentration performs better or equally well under SFT alone, or that low-concentration data improves more under RL, would falsify the routing rule.

read the original abstract

While Hybrid Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become the standard paradigm for training LLM agents, effective mechanisms for data allocation between these stages remain largely underexplored. Current data arbitration strategies often rely on surface-level heuristics that fail to diagnose intrinsic learning needs. Since SFT targets pattern consolidation through imitation while RL drives structural adaptation via exploration, misaligning data with these functional roles causes severe optimization interference. We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation. Extensive experiments on WebShop and ALFWorld demonstrate that PRISM achieves a Pareto improvement, outperforming state-of-the-art hybrid methods while reducing computational costs by up to 3.22$\times$. Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM uses gradient spatial concentration to route data between SFT and RL but the link to Schema Theory and the lack of controls leave the gains hard to trust.

read the letter

The new part here is PRISM's use of gradient spatial concentration to split data: high concentration routes to RL for structural adaptation, diffuse gradients to SFT for consolidation, all tied to Schema Theory. It targets a practical gap in hybrid LLM agent training where people usually rely on surface heuristics for allocation. The experiments claim Pareto gains over prior hybrids on WebShop and ALFWorld plus up to 3.22x lower compute, which is the kind of result that could matter for real pipelines if it holds.

Referee Report

3 major / 2 minor

Summary. The paper proposes PRISM, a dynamics-aware framework for arbitrating data between SFT and RL stages when training LLM agents. Grounded in Schema Theory, it routes examples to RL when gradients show high spatial concentration (interpreted as high cognitive conflict requiring structural adaptation) and to SFT when gradients are diffuse (interpreted as suitable for consolidation). Experiments on WebShop and ALFWorld report that PRISM outperforms state-of-the-art hybrid methods while achieving up to 3.22× computational cost reduction.

Significance. If the gradient spatial concentration metric reliably distinguishes consolidation from adaptation needs, the approach could improve training efficiency for LLM agents by reducing interference between imitation and exploration objectives. The reported Pareto gains and cost savings would be notable for scalable agent alignment, especially if the geometric criterion generalizes beyond the two evaluated environments.

major comments (3)

[§3.2] §3.2: The routing rule equates high spatial concentration of gradients with high cognitive conflict requiring RL, yet no derivation is supplied showing why this particular geometric property (rather than magnitude, variance, or optimization trajectory length) encodes conflict degree as defined in Schema Theory.
[§4.2] §4.2: The experiments report 3.22× cost reduction and Pareto improvement but omit ablations that replace the concentration metric with random routing or gradient-magnitude routing while holding total SFT/RL data volume fixed; without these controls it is unclear whether gains stem from the proposed geometric criterion.
[§4.3] §4.3: Performance tables lack statistical significance tests, standard errors across random seeds, or multiple-run averages, so the consistency of the claimed outperformance over hybrid baselines cannot be assessed.

minor comments (2)

[§3.1] Notation for the concentration metric is introduced without an explicit equation; adding a numbered definition would improve reproducibility.
[§4] The abstract states 'extensive experiments' but §4 provides limited detail on exact baseline implementations and hyperparameter matching.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the theoretical grounding, experimental controls, and statistical reporting in the manuscript.

read point-by-point responses

Referee: [§3.2] The routing rule equates high spatial concentration of gradients with high cognitive conflict requiring RL, yet no derivation is supplied showing why this particular geometric property (rather than magnitude, variance, or optimization trajectory length) encodes conflict degree as defined in Schema Theory.

Authors: We appreciate the call for greater theoretical precision. The spatial-concentration criterion is motivated by Schema Theory’s distinction between assimilation (low-conflict consolidation via SFT) and accommodation (high-conflict structural adaptation via RL). Concentrated gradients indicate that updates are localized to a small subspace, which we interpret as evidence that the model must restructure its internal schema rather than merely reinforce existing patterns. While the current manuscript presents this link interpretively rather than through a formal derivation, we will expand §3.2 with additional geometric intuition drawn from the optimization landscape and will include a short appendix deriving why concentration (as opposed to magnitude or variance) better captures the need for accommodation under the schema-theoretic framing. revision: yes
Referee: [§4.2] The experiments report 3.22× cost reduction and Pareto improvement but omit ablations that replace the concentration metric with random routing or gradient-magnitude routing while holding total SFT/RL data volume fixed; without these controls it is unclear whether gains stem from the proposed geometric criterion.

Authors: We agree that isolating the contribution of the concentration metric requires these controls. In the revised manuscript we will add two new ablation tables (random routing and gradient-magnitude routing) that keep the exact SFT/RL data cardinalities identical to those used by PRISM. These results will be reported alongside the original Pareto curves and cost figures so that readers can directly assess whether the geometric criterion, rather than allocation volume alone, drives the observed gains. revision: yes
Referee: [§4.3] Performance tables lack statistical significance tests, standard errors across random seeds, or multiple-run averages, so the consistency of the claimed outperformance over hybrid baselines cannot be assessed.

Authors: We acknowledge that the current tables report single-run point estimates. We will re-run all experiments with five independent random seeds, report mean performance with standard errors, and add paired t-test p-values comparing PRISM against each hybrid baseline. The revised tables will appear in §4.3 and the appendix will contain the full per-seed results. revision: yes

Circularity Check

1 steps flagged

PRISM equates gradient spatial concentration with Schema-Theory cognitive conflict by definitional assertion

specific steps

self definitional [Abstract]
"We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation."

The 'degree of cognitive conflict' is not derived from Schema Theory or measured independently; it is stipulated to be identical to the spatial concentration property of the gradients. Consequently the routing decision (high concentration to RL, diffuse to SFT) is true by the paper's own definitional mapping rather than by any demonstrated causal or predictive link.

full rationale

The paper's central mechanism claims to arbitrate SFT vs RL data by measuring 'degree of cognitive conflict' via gradient geometry, but the abstract directly identifies high spatial concentration as the high-conflict signal without an independent derivation or external benchmark linking the two. This reduces the routing rule to a re-labeling of the chosen metric. No equations, fitted parameters called predictions, or self-citation chains are visible in the provided text that would raise the score further; the circularity is limited to the load-bearing interpretive step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Schema Theory supplies a valid mapping from gradient geometry to cognitive conflict; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Schema Theory supplies a valid mapping from gradient spatial concentration to degree of cognitive conflict with existing knowledge.
The framework is explicitly grounded in Schema Theory for arbitrating data between consolidation and adaptation.

pith-pipeline@v0.9.0 · 5531 in / 1228 out tokens · 59480 ms · 2026-05-16T15:35:43.881307+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
cs.CL 2026-04 accept novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.