TrajTok: Learning Trajectory Tokens enables better Video Understanding

Ashutosh Kumar; Chenhao Zheng; Chun-Liang Li; Jianing Zhang; Jieyu Zhang; Oncel Tuzel; Quan Kong; Ranjay Krishna; Weikai Huang

arxiv: 2602.22779 · v3 · pith:O5FPJH7Inew · submitted 2026-02-26 · 💻 cs.CV

TrajTok: Learning Trajectory Tokens enables better Video Understanding

Chenhao Zheng , Jieyu Zhang , Jianing Zhang , Weikai Huang , Ashutosh Kumar , Quan Kong , Oncel Tuzel , Chun-Liang Li

show 1 more author

Ranjay Krishna

This is my paper

Pith reviewed 2026-05-15 19:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords video tokenizationtrajectory tokensvideo understandingCLIP modelend-to-end tokenizerobject trajectoriestoken efficiencyvideo retrieval

0 comments

The pith

TrajTok learns trajectory tokens end-to-end through implicit space-time clustering to improve video model accuracy and efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard video tokenization by patchifying every frame produces too many redundant tokens that limit scalability. TrajTok replaces this with an end-to-end tokenizer whose unified segmenter clusters pixels across space and time in one forward pass to output object trajectories whose count adapts to semantic complexity rather than video length. The module trains jointly with the downstream task, favoring adaptability over pixel-level accuracy. This produces TrajViT2, a from-scratch video CLIP model that records the highest accuracy on classification and retrieval benchmarks while using compute comparable to leading token-merging approaches. The same component also functions as a probing head or vision-language alignment connector with strong results on long-video reasoning.

Core claim

TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, the tokenizer dynamically adjusts token granularity to semantic complexity independent of video duration, enabling a video CLIP model trained from scratch to reach the best accuracy at scale across classification and retrieval benchmarks while matching the efficiency of the best token-merging methods.

What carries the argument

Unified segmenter that performs implicit clustering over pixels in space and time to produce object trajectories.

If this is right

Video models can handle longer sequences without token count growing with duration.
Accuracy on classification and retrieval improves while compute stays comparable to token-merging methods.
The same tokenizer module works as a probing head on frozen visual features.
It functions as an alignment connector inside vision-language models for long-video reasoning.
Token count becomes independent of video length and scales with scene complexity instead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Trajectory tokens could support more consistent temporal modeling in video generation or prediction tasks.
The single-pass clustering approach may reduce latency in real-time video analysis pipelines.
Integration of TrajTok-style adapters with audio or text streams could improve multimodal temporal alignment.

Load-bearing premise

Implicit clustering of pixels in space and time will produce trajectories that remain semantically useful for downstream video tasks when the segmenter trains only for task adaptability rather than pixel-level fidelity.

What would settle it

A controlled experiment in which a standard patch-based video model or token-merging baseline outperforms TrajViT2 on large-scale video classification and retrieval benchmarks at matched compute and model size.

Figures

Figures reproduced from arXiv: 2602.22779 by Ashutosh Kumar, Chenhao Zheng, Chun-Liang Li, Jianing Zhang, Jieyu Zhang, Oncel Tuzel, Quan Kong, Ranjay Krishna, Weikai Huang.

**Figure 1.** Figure 1: (a) Traditional video tokenization splits a video into space-time patches, introducing large number of redundant tokens. (b) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the TrajTok architecture. TrajTok comprises a trajectory segmenter and a trajectory encoder. The segmenter proposes trajectory masks for all objects in an image or video within a single forward pass. The encoder then aggregates raw video pixels or encoded visual features (parameterized by f in the figure) according to these masks to produce trajectory tokens. The number of tokens per trajectory… view at source ↗

**Figure 3.** Figure 3: Training with downstream understanding tasks reshapes the segmentation granularity. We visualize the trajectory masks produced by our segmenter when trained with only segmentation supervision versus jointly with segmentation and CLIP objectives. The CLIP objective reshapes the segmentation granularity, producing finer foreground object masks while merging background regions. 3. TrajTok We aim to design an … view at source ↗

**Figure 4.** Figure 4: TrajTok is a versatile module applicable across pretraining, feature adaptation, and finetuning stages. We demonstrate its use in three scenarios: TrajViT2, which trains a visual encoder from scratch; TrajAdapter, which adapts pretrained features for downstream tasks; and TrajVLM, which uses TrajTok as a connector in LLaVA-style large vision–language models. Mhard k,t,i,j = 1. This refinement recovers f… view at source ↗

**Figure 6.** Figure 6: Test time FLOPs comparison under different frame numbers. clips. We train TrajViT2, TrajViT, and ViT3D on all three scales and report their performance on video benchmarks. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Trajectory masks produced by our segmenter vs. by TrajViT pipeline. While our segmenter produces coarser masks and may miss very small objects, it demonstrates strong semantic grouping ability that is sufficient for downstream understanding tasks. recognition accuracy on the Kinetics-400 and SomethingSomething V2 (SSv2) benchmarks. Videos are uniformly sampled to 16 frames and sent to segmenter in one for… view at source ↗

**Figure 8.** Figure 8: VideoQA results for TrajTok applying to large vision-language model. VLM with TrajTok as connector (TrajVLM) notably outperforms patch pooling baseline (PatchVLM) in long-video benchmarks, while the performance is mixed for short-video benchmark. Module Variation VEQ (%) STQ (%) Retrieval (R@5) Default Architecture 42.3 70.1 22.1 Backbone no hierarchical features 39.3 (↓ 3.0) 66.2 (↓ 3.9) 19.2 (↓ 2.9) Outp… view at source ↗

**Figure 9.** Figure 9: Qualitative Examples of the trajectory masks produced by our segmenter. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrajTok's end-to-end trajectory tokenizer is a clear design step forward but the performance claims rest on unshown experiments.

read the letter

The main point is that TrajTok builds a tokenizer that produces object trajectories through implicit space-time pixel clustering in one forward pass and trains it jointly with the downstream video model. This removes the need for separate external segmentation and tracking pipelines that earlier trajectory approaches required. The design lets token count depend on semantic complexity rather than video length, and the authors show it can slot in as a probing head or VLM connector as well as a tokenizer. They train a video CLIP model called TrajViT2 from scratch and state that it reaches top accuracy on classification and retrieval while matching the efficiency of strong token-merging baselines. That combination of integration and versatility is the concrete advance over prior work. The paper does a reasonable job laying out the architecture and the motivation for co-training the segmenter for adaptability instead of pixel-level fidelity. The idea that a lightweight module can adapt its granularity on the fly is worth testing. The soft spots are more substantial than minor. The abstract contains no numbers, no ablations, and no direct checks on whether the clusters actually correspond to coherent object trajectories. Without those details it is difficult to know if the reported gains come from better tokens or from other modeling choices. The assumption that downstream-only optimization will produce semantically useful trajectories could fail if the clusters latch onto low-level motion or noise instead. That risk is exactly what the stress-test note flags, and the lack of trajectory-quality metrics leaves it open. This paper is aimed at researchers working on efficient video transformers and long-form understanding. Anyone trying to cut token counts without losing accuracy would find the unified segmenter description useful to build on. It deserves a serious referee because the core idea is distinct from existing pipelines and the problem it targets is real, even though the experiments need close scrutiny on the actual benchmark results and ablations.

Referee Report

2 major / 1 minor

Summary. The paper proposes TrajTok, an end-to-end trainable video tokenizer module containing a unified segmenter that performs implicit space-time pixel clustering to produce object trajectories in a single forward pass. TrajTok is co-trained with downstream video models to adapt token granularity to semantic complexity rather than video duration, avoiding external segmentation pipelines. The authors implement this in TrajViT2, a video CLIP model trained from scratch, claiming state-of-the-art accuracy on classification and retrieval benchmarks at scale with efficiency comparable to token-merging methods. TrajTok is also shown as a versatile component in TrajAdapter for probing pretrained features and TrajVLM as an alignment connector for long-video reasoning.

Significance. If the empirical performance claims hold with proper verification, TrajTok could advance efficient video understanding by enabling adaptive, task-aware tokenization without heavy external dependencies, potentially improving scalability for long videos while maintaining or boosting accuracy over patch-based or merging baselines.

major comments (2)

[Abstract] Abstract: the central claim that TrajViT2 'achieves the best accuracy at scale across both classification and retrieval benchmarks' is presented without any quantitative results, tables, or specific benchmark numbers, preventing verification of the reported gains over token-merging methods.
[Method] TrajTok method description: the unified segmenter is optimized solely for downstream adaptability via implicit clustering, but no ablations or trajectory-quality metrics (e.g., against ground-truth tracks) are referenced to confirm that the resulting tokens preserve object-level semantics rather than low-level motion patterns; this directly bears on whether the accuracy advantage holds.

minor comments (1)

[Abstract] Abstract: the phrasing 'maintaining efficiency comparable to the best token-merging methods' would benefit from explicit FLOPs or token-count comparisons even at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and verifiability where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that TrajViT2 'achieves the best accuracy at scale across both classification and retrieval benchmarks' is presented without any quantitative results, tables, or specific benchmark numbers, preventing verification of the reported gains over token-merging methods.

Authors: We agree that the abstract would benefit from explicit numbers to support immediate verification. In the revised version, we will add key quantitative results (e.g., top-1 accuracy on Kinetics-400 and retrieval mAP gains relative to token-merging baselines) directly into the abstract, with references to the corresponding tables. revision: yes
Referee: [Method] TrajTok method description: the unified segmenter is optimized solely for downstream adaptability via implicit clustering, but no ablations or trajectory-quality metrics (e.g., against ground-truth tracks) are referenced to confirm that the resulting tokens preserve object-level semantics rather than low-level motion patterns; this directly bears on whether the accuracy advantage holds.

Authors: The segmenter is deliberately optimized for downstream task performance rather than explicit segmentation fidelity, as described in Section 3. We provide supporting ablations in Section 4 showing consistent accuracy and efficiency gains over patch-based and merging baselines. While direct quantitative metrics against ground-truth tracks are not included (the method prioritizes adaptability over pixel-level accuracy), qualitative trajectory visualizations and the observed downstream improvements indicate capture of semantic object-level patterns. We will expand the discussion of this design choice in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in TrajTok derivation chain

full rationale

The paper presents TrajTok as a novel end-to-end trainable video tokenizer module that performs implicit space-time pixel clustering via a unified segmenter, co-trained directly for downstream video understanding objectives. No equations, derivations, or self-citations are shown that reduce the claimed performance gains (e.g., TrajViT2 accuracy on classification/retrieval) to quantities defined by the method's own fitted parameters or inputs by construction. The approach is described as a new architectural component evaluated empirically on standard benchmarks, with efficiency claims tied to token reduction rather than tautological redefinitions. This is a standard empirical method paper without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a lightweight implicit clustering module can generate task-useful trajectories without external supervision or pixel-perfect accuracy; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Implicit space-time clustering of pixels produces object trajectories that are semantically meaningful for video understanding when optimized for downstream performance rather than segmentation fidelity.
Invoked to justify the unified segmenter design and its prioritization of adaptability.

pith-pipeline@v0.9.0 · 5570 in / 1231 out tokens · 45280 ms · 2026-05-15T19:09:34.106646+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity...
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use a combination of Dice loss and Focal loss... to prioritize the discovery of all object regions over strict pixel-level class accuracy.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.