archive

Every paper Pith has read. Search by title, abstract, or pith.

9568 papers in cs.CV · page 14

cs.CV 2026-05-18 reviewed

Lance beats prior open models at image and video generation
Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fengyi Fu +12
cs.CV 2026-05-18 reviewed

Fused Earth embeddings beat best single model in four of six tasks
Better Together: Evaluating the Complementarity of Earth Embedding Models

Thijs L van der Plas +5
cs.CV 2026-05-18 reviewed

Learned controller improves long-horizon GUI agents via selective memory
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Ziyun Zeng +5
cs.CV 2026-05-18 reviewed

Geometric primitives recover object joints from casual videos
Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video

Arslan Artykov +3
cs.CV 2026-05-18 reviewed

Latent reasoning improves models without appearing at inference
Leveraging Latent Visual Reasoning in Silence

Dongyao Zhu +9
cs.CV 2026-05-18 reviewed

Dual controller reuses plans to cut game agent costs 55%
SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

Wencan Jiang +8
cs.CV 2026-05-18 reviewed

Cross-view data and explicit alignment advance MLLM spatial reasoning
CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Wei Wang +6
cs.RO 2026-05-18 reviewed

ManiSoft benchmark tests vision-language control on soft robotic arms
ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

Ziyu Wei +4
cs.CV 2026-05-18 reviewed

Sign-aware aggregation sustains unlearning across sequential VLM requests
CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

Shen Lin +5
cs.CV 2026-05-18 reviewed

Forward bridging of style proxies stabilizes continual adaptation
Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging

Zhilin Zhu +5
cs.CV 2026-05-18 reviewed

Token limits force VLMs to learn active perception
Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

Yuhuan Wu +4
cs.CV 2026-05-18 reviewed

Natural language lets video models control multiple entities at once
Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

Shangwen Zhu +13
cs.CV 2026-05-18 reviewed

Decoupling tokens fixes spatial bias in novel view synthesis
Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

Yihang Wu +6
cs.CV 2026-05-18 reviewed

Benchmark measures when models should speak in video streams
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Ruixiang Zhao +6
cs.CV 2026-05-18 reviewed

Quality signals steer flow matching to fix occluded hands in video
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

Huajian Zeng +5
cs.CV 2026-05-18 reviewed

Low-rank attention enables hyperspectral models to handle sensor shifts
LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift

Haozhe Si +4
cs.CV 2026-05-18 reviewed

Color features alone classify cancer at up to 89% accuracy
Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification

Farnaz Kheiri +2
cs.CV 2026-05-18 reviewed

Weak supervision enables better radar scene flow than LiDAR methods
Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation

Jingyun Fu +2
cs.CV 2026-05-18 reviewed

2D images and odometry beat LiDAR for radar scene flow
Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation

Jingyun Fu +2
cs.CV 2026-05-18 reviewed

Self-distilled MIM leads medical segmentation transfer
Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks

Jue Jiang +1
cs.CV 2026-05-18 reviewed

First end-to-end model jointly edits audio and video from text
InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

Haojie Zheng +4
cs.CV 2026-05-18 reviewed

Speech supervision improves MRI vocal tract segmentation at test time
Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI

Daiqi Liu +13
cs.CV 2026-05-18 reviewed

Recurrent reasoning adapts CLIP with 6K parameters
PERL: Parameter Efficient Reasoning in CLIP Latent Space

Simone Carnemolla +4
cs.CV 2026-05-18 reviewed

Agent turns top-down room images into executable Blender code
Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

Yixuan Yang +7
cs.CV 2026-05-18 reviewed

NeRF extensions fix illumination and pose issues for spacecraft models
NeRF-based Spacecraft Reconstruction from Monocular Imagery Under Illumination Variability and Pose Uncertainty

Antoine Legrand +2
cs.CV 2026-05-18 reviewed

Per-image tweaks let NeRF reconstruct spacecraft despite lighting shifts and pose errors
NeRF-based Spacecraft Reconstruction from Monocular Imagery Under Illumination Variability and Pose Uncertainty

Antoine Legrand +2
cs.CV 2026-05-18 reviewed

Accuracy unchanged when latent visual tokens replaced by dummies
What's Holding Back Latent Visual Reasoning?

Andr\'e G. Viveiros +3
cs.CV 2026-05-18 reviewed

1,309-page dataset targets handwritten music recognition
A Dataset for the Recognition of Historical and Handwritten Music Scores in Western Notation

Pau Torras +9
cs.IR 2026-05-18 reviewed

Text guidance focuses full images for cropped-query e-commerce search
TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

Xinyu Sun +7
cs.CV 2026-05-18 reviewed

Multi-robot MLLM lifts spatial reasoning accuracy by 7 percent
Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

Kunyu Peng +11
cs.CV 2026-05-18 reviewed

Geometry-aware coresets lift VLM accuracy in pathology without training
Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology

Franciskus Xaverius Erick +2
cs.CV 2026-05-18 reviewed

Infrastructure dataset shows foundation models fall short on defects
Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models

Nicola Farronato +8

4 Piths
cs.CV 2026-05-18 reviewed

AIS data alone builds graph for global ship arrival forecasts
Historical Knowledge Graphs for Global Maritime Estimated Time of Arrival

Neofytos Dimitriou
cs.CG 2026-05-18 reviewed

Cross-ratios unify across grades in n-dimensional PGA
Generalize cross-ratios in n-dimensional Plane-Based Geometric Algebra

Enzo Harquin (LIGM) +4
cs.CV 2026-05-18 reviewed

Agent planner raises physical accuracy in video models
NEWTON: Agentic Planning for Physically Grounded Video Generation

Yuxiang Feng +9
cs.CV 2026-05-18 reviewed

Frozen vision model serves as generalist image tokenizer
Vision Foundation Models as Generalist Tokenizers for Image Generation

Anlin Zheng +7
cs.CV 2026-05-18 reviewed

Reward makes video generators obey scene geometry
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

Jan Ackermann +5
cs.CV 2026-05-18 reviewed

Learned bias in visual attention boosts multimodal models by 3 points
RAVE: Re-Allocating Visual Attention in Large Multimodal Models

Xi Leng +6
cs.CV 2026-05-18 reviewed

Parameter-free attention matches CSRNet accuracy without extra parameters
Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport

Aida Rostamza +3
cs.CV 2026-05-18 reviewed

KV selection per frame and head speeds video diffusion 1.48x
Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

Peiliang Cai +10
cs.CV 2026-05-18 reviewed

Skew Gaussians cut artifacts in real-time 3D scene views
3D Skew Gaussian Splatting with Any Camera Trajectory Visualization Engine

Beizhen Zhao +4
cs.CV 2026-05-18 reviewed

Deep ensembles calibrate uncertainty better than cross-validation in segmentation
Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

Tristan Kirscher (ICube +9
cs.CV 2026-05-18 reviewed

Deep ensembles calibrate uncertainty better than cross-validation folds
Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

Tristan Kirscher (ICube +9
cs.CV 2026-05-18 reviewed

Separate ViT encoding plus cross-attention improves VP background matting
CineMatte: Background Matting for Virtual Production and Beyond

Yuanjian He +3
cs.CV 2026-05-18 reviewed

RAE v2 reaches SOTA gFID 1.06 in 80 epochs on ImageNet
Improved Baselines with Representation Autoencoders

Jaskirat Singh +5
cs.CV 2026-05-18 reviewed

Wasserstein criterion boosts accuracy of small medical image QA models
Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

Luca Hagen +4
cs.LG 2026-05-18 reviewed

Port-Hamiltonian routing shrinks latent space by 4-8% in world models
PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

Xueyu Luan +1
cs.CV 2026-05-18 reviewed

Single-pass Hamming loss yields collision-resistant fine-grained hashes
Collision-Resistant Single-Pass Method for Unsupervised Fine-Grained Image Hashing

Anh-Kiet Duong +2
cs.CV 2026-05-18 reviewed

The paper proposes the Information Bottleneck Adapter (IB-Adapter)
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

Yiyang Fu +9
cs.CV 2026-05-18 reviewed

Semantic compression unlocks exact-likelihood image generation by flows
SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

Longtao Jiang +6