Elastictok: Adaptive tokenization for image and video

Elastictok: Adaptive tokenization for image, video , author= · 2024 · arXiv 2410.08368

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 2 baseline 1

citation-polarity summary

background 1 baseline 1 support 1

representative citing papers

AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.

ChannelTok: Efficient Flexible-Length Vision Tokenization

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

ChannelTok introduces channel-wise tokenization with stochastic tail-dropping to achieve rFID 2.92 on ImageNet at 8.6x faster decoding and 2.1x smaller size than prior flexible tokenizers.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

Structure over Pixels: Learning Variable-Length Visual Programs

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

STROP learns variable-length discrete visual programs for images by training a length head against frozen DINOv3 features in a four-phase curriculum while bypassing pixel reconstruction.

DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking

cs.CV · 2026-03-06 · unverdicted · novelty 7.0

DC-DiT learns dynamic chunking to allocate fewer tokens to smooth or noisy regions and more to detailed or late-stage areas, cutting inference FLOPs up to 36.8% while improving FID up to 37.8% on class-conditional ImageNet generation.

TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

TivTok factorizes video clips into reusable time-invariant tokens and frame-specific time-variant tokens via Scope-Induced Factorization and Invariant Broadcasting, achieving 2.91x better compression for 128-frame videos on benchmarks.

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

A parameter-free approach drops redundant video tokens via temporal L1 differences in frozen latent space and reconstructs them with LIT, yielding 31x speedup over ElasticTok-CV on TokenBench and DAVIS.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

ELT: Elastic Looped Transformers for Visual Generation

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.

Accelerating Vision Transformers with Adaptive Patch Sizes

cs.CV · 2025-10-20 · conditional · novelty 6.0

APT adaptively varies patch sizes within a single image to reduce ViT token count, delivering 40-50% throughput gains on large models with no downstream performance loss.

FAST: Efficient Action Tokenization for Vision-Language-Action Models

cs.RO · 2025-01-16 · unverdicted · novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.

Swift Sampling: Selecting Temporal Surprises via Taylor Series

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.

citing papers explorer

Showing 11 of 11 citing papers after filters.

AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens cs.CV · 2026-06-05 · unverdicted · none · ref 53
AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.
ChannelTok: Efficient Flexible-Length Vision Tokenization cs.CV · 2026-06-03 · unverdicted · none · ref 27
ChannelTok introduces channel-wise tokenization with stochastic tail-dropping to achieve rFID 2.92 on ImageNet at 8.6x faster decoding and 2.1x smaller size than prior flexible tokenizers.
Diffusing in the Right Space: A Systematic Study of Latent Diffusability cs.CV · 2026-06-02 · unverdicted · none · ref 90
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
Structure over Pixels: Learning Variable-Length Visual Programs cs.CV · 2026-05-26 · unverdicted · none · ref 11
STROP learns variable-length discrete visual programs for images by training a length head against frozen DINOv3 features in a four-phase curriculum while bypassing pixel reconstruction.
DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking cs.CV · 2026-03-06 · unverdicted · none · ref 16
DC-DiT learns dynamic chunking to allocate fewer tokens to smooth or noisy regions and more to detailed or late-stage areas, cutting inference FLOPs up to 36.8% while improving FID up to 37.8% on class-conditional ImageNet generation.
TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization cs.CV · 2026-06-16 · unverdicted · none · ref 93
TivTok factorizes video clips into reusable time-invariant tokens and frame-specific time-variant tokens via Scope-Induced Factorization and Invariant Broadcasting, achieving 2.91x better compression for 128-frame videos on benchmarks.
Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting cs.CV · 2026-06-04 · unverdicted · none · ref 31
A parameter-free approach drops redundant video tokens via temporal L1 differences in frozen latent space and reconstructs them with LIT, yielding 31x speedup over ElasticTok-CV on TokenBench and DAVIS.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 94
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
ELT: Elastic Looped Transformers for Visual Generation cs.CV · 2026-04-10 · unverdicted · none · ref 78
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
FAST: Efficient Action Tokenization for Vision-Language-Action Models cs.RO · 2025-01-16 · unverdicted · none · ref 65
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.
Swift Sampling: Selecting Temporal Surprises via Taylor Series cs.CV · 2026-05-21 · unverdicted · none · ref 51
Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.

Elastictok: Adaptive tokenization for image and video

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer