PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
An image is worth 16x16 words: Transformers for image recognition at scale
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 2polarities
background 2representative citing papers
Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.
ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.
PRPO is a paragraph-level policy optimization technique that grounds vision-language model reasoning in image content to raise deepfake detection accuracy and reasoning quality.
Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
ConvFormer3D-TAP classifies six cine CMR views at 96% accuracy using 3D conv tokenization, multiscale attention, and uncertainty-aware multi-clip fusion on 150k sequences.
citing papers explorer
-
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
-
Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling
Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.
-
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.
-
PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection
PRPO is a paragraph-level policy optimization technique that grounds vision-language model reasoning in image content to raise deepfake detection accuracy and reasoning quality.
-
Bernini: Latent Semantic Planning for Video Diffusion
Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
-
ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines
ConvFormer3D-TAP classifies six cine CMR views at 96% accuracy using 3D conv tokenization, multiscale attention, and uncertainty-aware multi-clip fusion on 150k sequences.