pith. sign in

arxiv: 2604.05112 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.AI· cs.RO

Vintix II: Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner

Pith reviewed 2026-05-10 18:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO
keywords in-context reinforcement learningdecision pre-trained transformerflow matchingalgorithm distillationgeneralist agentsmulti-domain environments
0
0 comments X

The pith

Decision Pre-Trained Transformer scales to hundreds of tasks and generalizes to unseen reinforcement learning problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the Decision Pre-Trained Transformer to diverse multi-domain environments by using Flow Matching for training. This creates an agent that generalizes clearly to a held-out test set of tasks. A sympathetic reader would care because it shows in-context reinforcement learning can produce generalist agents that handle new problems without expert data for each task. The work improves on scaled Algorithm Distillation and works in both online and offline inference.

Core claim

We obtain an agent trained across hundreds of diverse tasks that achieves clear gains in generalization to the held-out test set. This agent improves upon prior AD scaling and demonstrates stronger performance in both online and offline inference, reinforcing ICRL as a viable alternative to expert distillation for training generalist agents.

What carries the argument

The Decision Pre-Trained Transformer with Flow Matching, which enables scaling while preserving the interpretation as Bayesian posterior sampling.

Load-bearing premise

The environments selected for training and testing are diverse enough to evaluate true generalization instead of the model simply learning patterns common to the training tasks.

What would settle it

If the trained Decision Pre-Trained Transformer shows no improvement or performs worse than the scaled Algorithm Distillation baseline on the held-out tasks, the claim of superior generalization would be false.

Figures

Figures reproduced from arXiv: 2604.05112 by Albina Klepach, Aleksandr Serkov, Alexander Derevyagin, Alexander Nikulin, Alexey Zemtsov, Andrei Polubarov, Artyom Grishin, Daniil Tikhonov, Igor Saprygin, Ilya Zisman, Lyubaykin Nikita, Maksim Zhdanov, Mark Averchenko, Vladislav Kurenkov.

Figure 1
Figure 1. Figure 1: Approach Overview. Stage 1: A dataset collected using the noise-distillation technique (Zisman et al., 2024), which covers suboptimal regions of the state space, is relabeled with the demonstrator’s optimal actions. Stage 2: During training, the flow-matching (OT) head is condi￾tioned on hidden representations from the Decision Pretrained Transformer (Lee et al., 2023). Stage 3: At inference, actions are d… view at source ↗
Figure 2
Figure 2. Figure 2: Online inference on training tasks Although the model starts with no context, it infers relevant task-specific information for self-correction. All runs are conducted with four different random seeds, and the IQM of normalized scores is reported. our focus is on measuring the ability to match demonstrator performance rather than on absolute score differences. We do not compare our model with RL2 (Duan et a… view at source ↗
Figure 3
Figure 3. Figure 3: Domain-level normalized scores (Left) Evaluation results for 46 tasks unseen during training. (Right) Evaluation results for 209 training tasks. Offline runs of our model were conducted with a prompt size of 2500 transitions, compared to 5000 transitions for Vintix, and 25 and 100 episodes for REGENT on testing and training tasks, respectively. Domain-level aggregation is performed using IQM. 4.3 PERFORMAN… view at source ↗
Figure 4
Figure 4. Figure 4: Action beliefs over context sizes Action distributions are shown for different prompt sizes during offline evaluation, ranging from no context to 10, 100, and 500 transitions of task-specific demonstrator data. Projections into 2D space are obtained using Truncated SVD. The gradual de￾crease in distributions’ entropy indicates that our model exhibits a behavior consistent with posterior sampling. ▶ Progres… view at source ↗
Figure 5
Figure 5. Figure 5: Normalized returns vs. number of demonstrations Offline evaluation is conducted on 46 held-out tasks with task-specific prompts of varying size. Results for Vintix and REGENT are reported on the corresponding domains. IQM aggregation is applied across 4 random seeds. domains. On Meta-World, our model scales slightly faster than REGENT; however, it is constrained by a context length of 4096, which limits co… view at source ↗
Figure 6
Figure 6. Figure 6: Action beliefs over context sizes for other tasks in Meta-World [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Action beliefs over context sizes for other tasks in SinerGym [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Action beliefs over context sizes for other tasks in MetaDrive [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Action beliefs over context sizes for other tasks in CityLearn [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mean normalized noise-distilled trajectories for Meta-World domain [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mean normalized noise-distilled trajectories for Kinetix domain [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Mean normalized noise-distilled trajectories for Bi-DexHands domain [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Mean normalized noise-distilled trajectories for MetaDrive domain [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Mean normalized noise-distilled trajectories for Industrial-Benchmark domain [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Mean normalized noise-distilled trajectories for ControlGym domain [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Mean normalized noise-distilled trajectories for MuJoCo domain [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Mean normalized noise-distilled trajectories for SinerGym domain [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Mean normalized noise-distilled trajectories for CityLearn domain [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Mean normalized noise-distilled trajectories for HumEnv domain [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
read the original abstract

Recent progress in in-context reinforcement learning (ICRL) has demonstrated its potential for training generalist agents that can acquire new tasks directly at inference. Algorithm Distillation (AD) pioneered this paradigm and was subsequently scaled to multi-domain settings, although its ability to generalize to unseen tasks remained limited. The Decision Pre-Trained Transformer (DPT) was introduced as an alternative, showing stronger in-context reinforcement learning abilities in simplified domains, but its scalability had not been established. In this work, we extend DPT to diverse multi-domain environments, applying Flow Matching as a natural training choice that preserves its interpretation as Bayesian posterior sampling. As a result, we obtain an agent trained across hundreds of diverse tasks that achieves clear gains in generalization to the held-out test set. This agent improves upon prior AD scaling and demonstrates stronger performance in both online and offline inference, reinforcing ICRL as a viable alternative to expert distillation for training generalist agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper extends the Decision Pre-Trained Transformer (DPT) to multi-domain environments by training on hundreds of diverse tasks using Flow Matching, which is argued to preserve its interpretation as Bayesian posterior sampling. The central claim is that the resulting agent, termed Vintix II, achieves clear generalization gains on held-out test tasks, outperforming prior scaled Algorithm Distillation (AD) baselines in both online and offline in-context reinforcement learning inference.

Significance. If the empirical results are supported by rigorous controls, this would advance ICRL as a scalable alternative to expert distillation for generalist agents, with the Flow Matching choice providing a useful theoretical anchor. The work builds on prior DPT and AD efforts by demonstrating multi-domain scaling, but its impact hinges on whether the held-out gains reflect true in-context adaptation to novel MDPs rather than shared distributional patterns.

major comments (2)
  1. [Experiments] Experimental section: The abstract and results assert 'clear gains in generalization to the held-out test set' and improvement over AD scaling, yet no quantitative metrics (e.g., dynamics KL divergence, trajectory distribution distances, or state-action space overlap) are supplied to characterize how the held-out tasks differ from the training distribution. This is load-bearing for the generalization claim, as unquantified task diversity leaves open the possibility that performance reflects interpolation rather than scalable ICRL.
  2. [§3] §3 (Method): While Flow Matching is presented as a natural choice that preserves the Bayesian posterior sampling view, the manuscript does not provide an explicit derivation or reference showing how the multi-domain training objective maps to posterior inference over new MDPs, including any biases introduced by the transformer parameterization or the specific flow-matching loss at scale.
minor comments (3)
  1. [Abstract] The abstract references 'hundreds of diverse tasks' without specifying the exact count, domain breakdown, or selection criteria; adding a table or paragraph with these details would improve reproducibility.
  2. [§2] Notation for the in-context RL setup (e.g., how context trajectories are formatted for the transformer) is introduced without a clear reference to prior DPT work; a brief recap equation would aid readers.
  3. [Figures] Figure captions for performance plots should include error bars or confidence intervals and specify the number of evaluation seeds to allow assessment of result stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the presentation of generalization results and the theoretical grounding.

read point-by-point responses
  1. Referee: [Experiments] Experimental section: The abstract and results assert 'clear gains in generalization to the held-out test set' and improvement over AD scaling, yet no quantitative metrics (e.g., dynamics KL divergence, trajectory distribution distances, or state-action space overlap) are supplied to characterize how the held-out tasks differ from the training distribution. This is load-bearing for the generalization claim, as unquantified task diversity leaves open the possibility that performance reflects interpolation rather than scalable ICRL.

    Authors: We agree that explicit quantitative characterization of train-test distributional differences would strengthen the generalization claims. The current manuscript describes training across hundreds of tasks from multiple distinct domains with evaluation on held-out tasks drawn from environments not seen during training. To address the concern directly, we will revise the experimental section (and add to the appendix) with metrics including dynamics KL divergence (where MDP parameters permit), trajectory distribution distances (e.g., via Wasserstein or MMD on state-action sequences), and state-action space overlap statistics. These additions will help quantify the degree of novelty and support the distinction between interpolation and scalable in-context adaptation. revision: yes

  2. Referee: [§3] §3 (Method): While Flow Matching is presented as a natural choice that preserves the Bayesian posterior sampling view, the manuscript does not provide an explicit derivation or reference showing how the multi-domain training objective maps to posterior inference over new MDPs, including any biases introduced by the transformer parameterization or the specific flow-matching loss at scale.

    Authors: The Bayesian posterior sampling interpretation for DPT follows from the original DPT formulation, where the model learns to sample from the posterior over policies given context. Flow Matching is adopted because it provides a simulation-free objective for learning the conditional vector field that aligns with this view. For the multi-domain extension, the objective aggregates context-policy pairs across diverse tasks to enable generalization. We acknowledge that the manuscript lacks an explicit derivation of the multi-domain objective to posterior inference over novel MDPs and does not analyze biases from the transformer architecture or flow-matching loss at scale. In revision we will add a concise derivation sketch in §3 (with further details in the appendix), including references to flow-matching literature on conditional generation and a discussion of potential biases and limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling claims rest on held-out evaluation, not self-referential definitions or fitted inputs.

full rationale

The paper's central claims are empirical: a Decision Pre-Trained Transformer is trained on hundreds of tasks using Flow Matching and evaluated on a held-out test set, with reported gains over prior AD scaling in both online and offline settings. No derivation chain is presented that reduces a prediction or first-principles result to its own inputs by construction. The choice of Flow Matching is described as preserving a Bayesian interpretation but is not used to derive the generalization result tautologically; performance metrics are measured directly on unseen tasks. Self-citations to the original DPT work are present but serve only as background for the method being scaled, not as load-bearing justification for the new multi-domain results. The held-out evaluation provides an external benchmark independent of the training procedure's fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract supplies minimal technical detail; the main assumption is that Flow Matching preserves the Bayesian interpretation at scale.

axioms (1)
  • domain assumption Flow Matching is a natural training choice that preserves DPT's interpretation as Bayesian posterior sampling
    Explicitly stated in the abstract as justification for the training method.

pith-pipeline@v0.9.0 · 5533 in / 1087 out tokens · 34979 ms · 2026-05-10T18:39:35.796175+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...