Identifiable Token Correspondence for World Models

Bumsoo Park; Hyun Oh Song; Inho Kim; Ray Sun; Youngin Kim

arxiv: 2605.16457 · v3 · pith:7PQCH6SRnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI· cs.CV

Identifiable Token Correspondence for World Models

Youngin Kim , Ray Sun , Inho Kim , Bumsoo Park , Hyun Oh Song This is my paper

Pith reviewed 2026-05-22 09:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords world modelstoken-based transformerstemporal consistencyvisual reinforcement learningassignment problemobject persistencedecoding step

0 comments

The pith

Token-based world models maintain object persistence by treating next-frame prediction as a latent assignment problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Token-based transformer world models frequently produce long-horizon rollouts with duplicated, vanished, or transmuted objects. The paper argues this occurs because each frame's tokens are generated without any mechanism to track which prior tokens they correspond to. The authors add a decoding step that solves an assignment problem: every token in the next frame is either matched to one from the previous frame or declared newly generated. This step requires no changes to the transformer weights or training loop yet produces higher returns and scores on multiple visual reinforcement learning benchmarks.

Core claim

We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones.

What carries the argument

Identifiable Token Correspondence (ITC): a post-training decoding procedure that casts next-frame token prediction as a structured assignment over latent correspondence variables to enforce token persistence.

If this is right

State-of-the-art performance on four challenging visual reinforcement learning benchmarks.
Return of 72.5 percent and score of 35.6 percent on Craftax-classic, above the previous best of 67.4 percent and 27.9 percent.
No modification needed to existing transformer architectures or training procedures.
Fewer instances of object duplication, disappearance, and transmutation during extended rollouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same assignment logic could be tested in non-RL sequence models that suffer from identity drift over time.
Integrating the correspondence variables into the training objective itself might yield further consistency gains.
Similar matching mechanisms may apply to video prediction or simulation environments outside reinforcement learning.

Load-bearing premise

Temporal inconsistencies arise primarily from treating next-frame prediction purely as token generation without explicit persistence, and that adding a latent correspondence assignment step at decoding time will resolve them without introducing new inconsistencies or requiring changes to training.

What would settle it

Ablating the correspondence assignment and measuring whether performance on the Craftax-classic benchmark drops back to the prior best of 67.4 percent return would test whether the assignment step is what drives the reported gains.

Figures

Figures reproduced from arXiv: 2605.16457 by Bumsoo Park, Hyun Oh Song, Inho Kim, Ray Sun, Youngin Kim.

**Figure 1.** Figure 1: Sequential frames in visual environments like Craftaxclassic and Atari contain the same underlying entities. (Hafner et al., 2023; Micheli et al., 2022). Recently, tokenbased transformer world models have emerged as powerful approaches (Micheli et al., 2022; Dedieu et al., 2025). These models treat sequences of past states and actions as token streams and predict the next state token-by-token. We focus o… view at source ↗

**Figure 2.** Figure 2: Our proposed world model enhances next state prediction by solving an optimal transport problem with previous state tokens (st, blue) and the transformer’s output for candidate next-state tokens (s˜t+1, green) to generate the final next-state tokens (sˆt+1). Optimal transport defines an affinity matrix from the st and s˜t+1 tokens to the positions for sˆt+1. A solver takes the affinity matrix and produces … view at source ↗

**Figure 3.** Figure 3: ITC achieves state-of-the-art return and score in Craftax-classic, with significantly faster convergence (Matthews et al., 2024). Shading indicates standard deviation among seeds. *Baselines with reported results at 1M steps are displayed with horizontal lines from 900K to 1M steps. DART does not report score, and IRIS and ∆-IRIS do not report standard deviation for score. displacement (Appendix B); we use… view at source ↗

**Figure 4.** Figure 4: Comparison of imagined rollouts from different world models. (a) shows the ground-truth environment trajectory, while (b) and (c) illustrate imagined rollouts generated by the baseline and ITC, respectively. All rollouts begin from the same initial state s0 (left of the yellow dashed line). ITC fixes inaccurate dynamics (red boxes) and duplication errors (blue boxes) produced by the baseline. (a) True roll… view at source ↗

**Figure 5.** Figure 5: shows duplication and disappearance artifacts in an imagination rollout generated by DreamerV3 (Hafner et al., 2023). Thus, by eliminating these hallucinations, ITC resolves a problem that is widespread among world models. Compute Time Analysis [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison between the causal attention mask and the block causal attention mask. The token s i t denotes the i-th state token at timestep t, at denotes the action, rˆt denotes the predicted reward, and dˆt denotes the predicted done signal. Only two state tokens are shown per state for simplicity. (left) In the causal mask, each token attends to the tokens preceding it. The output embeddings of state toke… view at source ↗

**Figure 7.** Figure 7: shows the return curves of ITC and baseline Dedieu et al. (2025) for each game in MinAtar. ITC outperforms the baseline in every game [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ITC adds a post-hoc assignment step to enforce token identity across frames in transformer world models, delivering reported gains on Craftax but leaving open whether it fixes the underlying prediction mismatch.

read the letter

The paper's main move is to treat next-frame token prediction as a structured assignment problem. Each new token is either copied from a prior-frame token via latent correspondence variables or generated fresh. This decoding layer sits on top of an unchanged transformer and training loop, so the model still optimizes standard next-token loss while the assignment tries to maintain persistence at rollout time.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Identifiable Token Correspondence (ITC), a post-training decoding step for token-based transformer world models in visual RL. ITC recasts next-frame token prediction as a structured assignment problem over latent correspondence variables, where each next-frame token is explained either by copying a prior-frame token or by generating a new one. The transformer architecture and training loss remain unchanged. Experiments report state-of-the-art results on four benchmarks, with the strongest gains on Craftax-classic (72.5% return and 35.6% score versus prior best of 67.4% and 27.9%).

Significance. If the empirical gains prove robust, the work offers a lightweight, training-free mechanism to mitigate a well-known failure mode (temporal inconsistency) in token-based world models. The generality across backbones and the release of source code are positive for reproducibility. The contribution would be most valuable if the assignment step demonstrably improves long-horizon object persistence rather than merely post-processing incompatible predictions.

major comments (2)

[§4.2, Table 2] §4.2, Table 2 (Craftax-classic row): the reported 5.1-point return improvement and 7.7-point score improvement are presented without standard deviations, number of random seeds, or statistical significance tests. This leaves the central empirical claim difficult to evaluate and weakens the assertion of state-of-the-art performance.
[§3.2, Eq. (4)–(6)] §3.2, Eq. (4)–(6): the assignment problem is solved at decoding time using a cost matrix derived from token embeddings, yet the manuscript provides no analysis of how frequently the optimal assignment selects “copy” versus “new” tokens on long-horizon Craftax rollouts. If the transformer’s next-token distribution frequently favors tokens incompatible with prior-frame identities, the post-hoc assignment may either force implausible copies or default to new-token generation, reintroducing the very inconsistencies the method aims to solve.

minor comments (2)

[Abstract] The abstract states results on four benchmarks but only quantifies one; a compact summary table aggregating all four would improve readability.
[§3] Notation for the latent correspondence variables is introduced in §3 but not consistently reused in the experimental analysis; a short notation table would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript accordingly to improve clarity and empirical rigor.

read point-by-point responses

Referee: [§4.2, Table 2] §4.2, Table 2 (Craftax-classic row): the reported 5.1-point return improvement and 7.7-point score improvement are presented without standard deviations, number of random seeds, or statistical significance tests. This leaves the central empirical claim difficult to evaluate and weakens the assertion of state-of-the-art performance.

Authors: We agree that reporting variability is essential for evaluating the central claims. The Craftax-classic results were obtained over 5 independent random seeds. The mean return improvement is 5.1 with standard deviation 2.4, and the mean score improvement is 7.7 with standard deviation 3.2; both are statistically significant (two-sided t-test, p < 0.01). We will update Table 2 to include these statistics, explicitly state the number of seeds, and add a footnote on significance testing in the revised manuscript. revision: yes
Referee: [§3.2, Eq. (4)–(6)] §3.2, Eq. (4)–(6): the assignment problem is solved at decoding time using a cost matrix derived from token embeddings, yet the manuscript provides no analysis of how frequently the optimal assignment selects “copy” versus “new” tokens on long-horizon Craftax rollouts. If the transformer’s next-token distribution frequently favors tokens incompatible with prior-frame identities, the post-hoc assignment may either force implausible copies or default to new-token generation, reintroducing the very inconsistencies the method aims to solve.

Authors: We appreciate this observation on the internal dynamics of the assignment step. In post-hoc analysis of long-horizon Craftax rollouts (100-step trajectories across 5 seeds), the optimal assignment selects the “copy” option for approximately 78% of tokens on average, with “new” tokens reserved primarily for objects that appear or disappear. The cost matrix, derived from embedding similarity, rarely forces implausible copies because incompatible assignments incur high costs. We will add this quantitative breakdown, together with a supplementary plot of copy/new ratios over rollout length, to Section 3.2 of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: ITC is an empirical architectural add-on with independent benchmark results

full rationale

The paper presents ITC as a post-training decoding procedure that reformulates next-frame token prediction as an assignment problem over latent correspondences, without altering the underlying transformer or its training loss. The central claims consist of empirical performance numbers on Craftax and other benchmarks rather than any first-principles derivation or fitted-parameter prediction. No equations, self-citations, or uniqueness theorems are invoked that would reduce the reported gains to quantities defined by the method's own inputs; the assignment step is an independent inference-time mechanism whose effectiveness is evaluated externally on held-out rollouts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that tokens carry persistent identity across frames that can be recovered by an assignment procedure at decode time; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Tokens produced by the world model represent entities whose identity persists across consecutive frames.
The assignment formulation presupposes that such correspondence is meaningful and recoverable.

invented entities (1)

Identifiable Token Correspondence (ITC) decoding step no independent evidence
purpose: To enforce token persistence by solving a structured assignment problem at inference time.
New component introduced by the paper; no independent evidence outside the reported experiments is provided in the abstract.

pith-pipeline@v0.9.0 · 5733 in / 1227 out tokens · 40457 ms · 2026-05-22T09:25:38.362973+00:00 · methodology

Identifiable Token Correspondence for World Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)