hub

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby · 2021

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

browse 14 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

TurboVGGT uses adaptive sparse global attention with varying sparsity levels across frames and layers plus frame attention to enable faster multi-view 3D reconstruction while keeping competitive quality versus prior state-of-the-art methods.

MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

MedCore achieves 60% parameter and 58.4% FLOP reduction on MedSAM with Dice 0.9549 and preserved boundary metrics via dual-intervention pruning and a new boundary leverage principle.

DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

DORA uses an online RL agent to adaptively merge tokens in Vision Transformers, reporting better accuracy-efficiency trade-offs than static baselines on ImageNet and OOD sets.

The Indra Representation Hypothesis for Multimodal Alignment

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.

Winfree Oscillatory Neural Network

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

WONN is a new oscillatory neural network based on generalized Winfree dynamics that scales competitively to ImageNet-1K and reaches 80.1% accuracy on Maze-hard with 1% of prior model parameters.

NARA: Anchor-Conditioned Relation-Aware Contextualization of Heterogeneous Geoentities

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

NARA introduces a unified self-supervised method for learning relational, context-dependent representations of heterogeneous vector geoentities that improves performance on building classification, traffic prediction, and POI recommendation.

When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Hierarchy-Aware Cross-Entropy improves image classification by incorporating class hierarchies into the loss through prediction aggregation and ancestral label smoothing, achieving mean accuracy gains of 4.66% in end-to-end training and 2.18% in linear probing.

Uncertainty-Aware Foundation Models for Clinical Data

cs.LG · 2026-04-05 · unverdicted · novelty 6.0

The work introduces uncertainty-aware foundation models for clinical data by learning set-valued patient representations that enforce consistency across partial observations and integrate multimodal self-supervised objectives.

Vision Transformers Need Better Token Interaction

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

Replacing softmax attention with entmax-1.5 in DINOv1 ViT-S/16 improves semantic segmentation mIoU on three benchmarks while keeping ImageNet linear-probing accuracy unchanged.

stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.

Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP

cs.LG · 2026-04-01 · unverdicted · novelty 4.0

Matched learning-rate experiments show LoRA retains substantially higher zero-shot transfer (45% vs 11% on EuroSAT, 58% vs 9% on Pets) than Full FT in CLIP adaptation.

State Space Models for Bioacoustics: A Comparative Evaluation with Transformers

cs.SD · 2025-12-03 · unverdicted · novelty 4.0

BioMamba matches Transformer performance on bioacoustics tasks while using significantly less VRAM.

Sharpness-Aware Minimization with Z-Score Gradient Filtering

cs.LG · 2025-05-05 · unverdicted · novelty 4.0

Z-Score Filtered SAM retains only high absolute Z-score gradient components per layer during the ascent step and reports higher test accuracy than standard SAM on CIFAR and Tiny-ImageNet benchmarks.

From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

cs.CV · 2025-11-19

citing papers explorer

Showing 6 of 6 citing papers after filters.

TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention cs.CV · 2026-05-14 · unverdicted · none · ref 10
TurboVGGT uses adaptive sparse global attention with varying sparsity levels across frames and layers plus frame attention to enable faster multi-view 3D reconstruction while keeping competitive quality versus prior state-of-the-art methods.
MedCore: Boundary-Preserving Medical Core Pruning for MedSAM cs.CV · 2026-05-13 · unverdicted · none · ref 4
MedCore achieves 60% parameter and 58.4% FLOP reduction on MedSAM with Dice 0.9549 and preserved boundary metrics via dual-intervention pruning and a new boundary leverage principle.
DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 1
DORA uses an online RL agent to adaptively merge tokens in Vision Transformers, reporting better accuracy-efficiency trade-offs than static baselines on ImageNet and OOD sets.
The Indra Representation Hypothesis for Multimodal Alignment cs.CV · 2026-04-06 · unverdicted · none · ref 12
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.
Vision Transformers Need Better Token Interaction cs.CV · 2026-05-22 · unverdicted · none · ref 1
Replacing softmax attention with entmax-1.5 in DINOv1 ViT-S/16 improves semantic segmentation mIoU on three benchmarks while keeping ImageNet linear-probing accuracy unchanged.
From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers cs.CV · 2025-11-19 · unreviewed · ref 11

An image is worth 16x16 words: Transformers for image recognition at scale

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer