Token- learner: What can 8 learned tokens do for images and videos?

· 2021 · arXiv 2106.11297

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.

FlowNar: Scalable Streaming Narration for Long-Form Videos

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

FlowNar achieves bounded memory and 3x higher throughput for streaming narration on Ego4D, EgoExo4D, and EpicKitchens100 by combining dynamic historical context removal with a Cross Linear Attentive Memory module.

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Head Similarity extends identity recognition to structured whole-head similarity by capturing intra-identity appearance variations via hierarchical supervision on a weakly-labeled video benchmark.

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.

PaLM-E: An Embodied Multimodal Language Model

cs.LG · 2023-03-06 · conditional · novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive transfer from joint training on language and robotics data.

Florence: A New Foundation Model for Computer Vision

cs.CV · 2021-11-22 · unverdicted · novelty 6.0

Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.

DinoLink: A Token-Centric Representation Compression Framework for Bandwidth-Constrained Collaborative V2X Perception

cs.CV · 2026-06-24 · unverdicted · novelty 4.0

DinoLink uses saliency-aware token pruning plus residual vector quantization to cut V2X bitrate by 139x while reporting 32.8% mAP on nuScenes.

citing papers explorer

Showing 1 of 1 citing paper after filters.

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory cs.CV · 2025-05-29 · unverdicted · none · ref 50
TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.

Token- learner: What can 8 learned tokens do for images and videos?

fields

years

verdicts

representative citing papers

citing papers explorer