hub Mixed citations

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny · 2025 · 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) · DOI 10.1109/cvpr52734.2025.00499

Mixed citation behavior. Most common role is method (57%).

23 Pith papers citing it

150 external citations · external index

Method 57% of classified citations

open at publisher browse 23 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

method 4 background 2 baseline 1

citation-polarity summary

use method 4 background 2 baseline 1

representative citing papers

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.

TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

TurboVGGT uses adaptive sparse global attention with varying sparsity levels across frames and layers plus frame attention to enable faster multi-view 3D reconstruction while keeping competitive quality versus prior state-of-the-art methods.

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT

cs.CV · 2026-03-08 · unverdicted · novelty 7.0

FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.

Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.

Unlocking Dense Metric Depth Estimation in VLMs

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.

4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A training-free progressive decoupling framework improves dynamic depth estimation in 4D reconstruction via mask-guided pose decoupling, topological subspace surgery, and Bayesian fusion, yielding better point-cloud metrics on benchmarks.

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.

TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

The Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors outperforms prior methods on dynamic benchmarks by cutting Mean Accuracy error 13.43% and raising segmentation F-measure 10.49% via three uncertainty mechanisms while keeping feed-forward speed.

SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

A feed-forward model regresses accurate Gaussian surfel geometry from sparse views using Nyquist-guided cross-view feature aggregation, achieving 100x speedup over optimization-based 3DGS surface methods on DTU benchmarks.

HD-VGGT: High-Resolution Visual Geometry Transformer

cs.CV · 2026-03-28 · unverdicted · novelty 6.0

HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while modulating unreliable features.

One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.

Context Unrolling in Omni Models

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

cs.GR · 2026-05-05 · unverdicted · novelty 4.0 · 2 refs

JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

cs.CV · 2026-04-06

citing papers explorer

Showing 23 of 23 citing papers.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views cs.CV · 2026-05-08 · unverdicted · none · ref 31
GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
Geo-Align: Video Generation Alignment via Metric Geometry Reward cs.CV · 2026-05-22 · unverdicted · none · ref 43
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory cs.CV · 2026-05-17 · unverdicted · none · ref 28
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention cs.CV · 2026-05-14 · unverdicted · none · ref 39
TurboVGGT uses adaptive sparse global attention with varying sparsity levels across frames and layers plus frame attention to enable faster multi-view 3D reconstruction while keeping competitive quality versus prior state-of-the-art methods.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 39
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 37
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images cs.CV · 2026-05-08 · unverdicted · none · ref 39
Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 11
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT cs.CV · 2026-03-08 · unverdicted · none · ref 11
FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.
PIXLRelight: Controllable Relighting via Intrinsic Conditioning cs.CV · 2026-05-18 · unverdicted · none · ref 42
A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.
Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction cs.CV · 2026-05-16 · unverdicted · none · ref 21
A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.
Unlocking Dense Metric Depth Estimation in VLMs cs.CV · 2026-05-15 · unverdicted · none · ref 54 · 2 links
DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.
4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation cs.CV · 2026-05-12 · unverdicted · none · ref 20
A training-free progressive decoupling framework improves dynamic depth estimation in 4D reconstruction via mask-guided pose decoupling, topological subspace surgery, and Bayesian fusion, yielding better point-cloud metrics on benchmarks.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 37 · 2 links
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation cs.CV · 2026-05-07 · unverdicted · none · ref 37
TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors cs.CV · 2026-04-10 · unverdicted · none · ref 4
The Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors outperforms prior methods on dynamic benchmarks by cutting Mean Accuracy error 13.43% and raising segmentation F-measure 10.49% via three uncertainty mechanisms while keeping feed-forward speed.
SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 68
A feed-forward model regresses accurate Gaussian surfel geometry from sparse views using Nyquist-guided cross-view feature aggregation, achieving 100x speedup over optimization-based 3DGS surface methods on DTU benchmarks.
HD-VGGT: High-Resolution Visual Geometry Transformer cs.CV · 2026-03-28 · unverdicted · none · ref 3
HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while modulating unreliable features.
One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems cs.CV · 2026-05-21 · unverdicted · none · ref 40
A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer cs.CV · 2026-05-14 · unverdicted · none · ref 66
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.
Context Unrolling in Omni Models cs.CV · 2026-04-23 · unverdicted · none · ref 40
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 81 · 2 links
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models cs.CV · 2026-04-06 · unreviewed · ref 123

Vggt: Visual geometry grounded transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer