archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 14
-
Lance beats prior open models at image and video generation
Lance: Unified Multimodal Modeling by Multi-Task Synergy
-
Fused Earth embeddings beat best single model in four of six tasks
Better Together: Evaluating the Complementarity of Earth Embedding Models
-
Learned controller improves long-horizon GUI agents via selective memory
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
-
Geometric primitives recover object joints from casual videos
Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video
-
Latent reasoning improves models without appearing at inference
Leveraging Latent Visual Reasoning in Silence
-
Dual controller reuses plans to cut game agent costs 55%
SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents
-
Cross-view data and explicit alignment advance MLLM spatial reasoning
CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark
-
ManiSoft benchmark tests vision-language control on soft robotic arms
ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics
-
Sign-aware aggregation sustains unlearning across sequential VLM requests
CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic
-
Forward bridging of style proxies stabilizes continual adaptation
Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging
-
Token limits force VLMs to learn active perception
Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth
-
Natural language lets video models control multiple entities at once
Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models
-
Decoupling tokens fixes spatial bias in novel view synthesis
Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling
-
Benchmark measures when models should speak in video streams
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
-
Quality signals steer flow matching to fix occluded hands in video
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video
-
Low-rank attention enables hyperspectral models to handle sensor shifts
LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift
-
Color features alone classify cancer at up to 89% accuracy
Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification
-
Weak supervision enables better radar scene flow than LiDAR methods
Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation
-
2D images and odometry beat LiDAR for radar scene flow
Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation
-
Self-distilled MIM leads medical segmentation transfer
Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks
-
First end-to-end model jointly edits audio and video from text
InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
-
Speech supervision improves MRI vocal tract segmentation at test time
Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI
-
Recurrent reasoning adapts CLIP with 6K parameters
PERL: Parameter Efficient Reasoning in CLIP Latent Space
-
Agent turns top-down room images into executable Blender code
Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis
-
NeRF extensions fix illumination and pose issues for spacecraft models
NeRF-based Spacecraft Reconstruction from Monocular Imagery Under Illumination Variability and Pose Uncertainty
-
Per-image tweaks let NeRF reconstruct spacecraft despite lighting shifts and pose errors
NeRF-based Spacecraft Reconstruction from Monocular Imagery Under Illumination Variability and Pose Uncertainty
-
Accuracy unchanged when latent visual tokens replaced by dummies
What's Holding Back Latent Visual Reasoning?
-
1,309-page dataset targets handwritten music recognition
A Dataset for the Recognition of Historical and Handwritten Music Scores in Western Notation
-
Text guidance focuses full images for cropped-query e-commerce search
TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval
-
Multi-robot MLLM lifts spatial reasoning accuracy by 7 percent
Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
-
Geometry-aware coresets lift VLM accuracy in pathology without training
Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology
-
Infrastructure dataset shows foundation models fall short on defects
Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models
4 Piths -
AIS data alone builds graph for global ship arrival forecasts
Historical Knowledge Graphs for Global Maritime Estimated Time of Arrival
-
Cross-ratios unify across grades in n-dimensional PGA
Generalize cross-ratios in n-dimensional Plane-Based Geometric Algebra
-
Agent planner raises physical accuracy in video models
NEWTON: Agentic Planning for Physically Grounded Video Generation
-
Frozen vision model serves as generalist image tokenizer
Vision Foundation Models as Generalist Tokenizers for Image Generation
-
Reward makes video generators obey scene geometry
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
-
Learned bias in visual attention boosts multimodal models by 3 points
RAVE: Re-Allocating Visual Attention in Large Multimodal Models
-
Parameter-free attention matches CSRNet accuracy without extra parameters
Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport
-
KV selection per frame and head speeds video diffusion 1.48x
Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion
-
Skew Gaussians cut artifacts in real-time 3D scene views
3D Skew Gaussian Splatting with Any Camera Trajectory Visualization Engine
-
Deep ensembles calibrate uncertainty better than cross-validation in segmentation
Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation
-
Deep ensembles calibrate uncertainty better than cross-validation folds
Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation
-
Separate ViT encoding plus cross-attention improves VP background matting
CineMatte: Background Matting for Virtual Production and Beyond
-
RAE v2 reaches SOTA gFID 1.06 in 80 epochs on ImageNet
Improved Baselines with Representation Autoencoders
-
Wasserstein criterion boosts accuracy of small medical image QA models
Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering
-
Port-Hamiltonian routing shrinks latent space by 4-8% in world models
PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics
-
Single-pass Hamming loss yields collision-resistant fine-grained hashes
Collision-Resistant Single-Pass Method for Unsupervised Fine-Grained Image Hashing
-
The paper proposes the Information Bottleneck Adapter (IB-Adapter)
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
-
Semantic compression unlocks exact-likelihood image generation by flows
SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation