archive

Every paper Pith has read. Search by title, abstract, or pith.

9568 papers in cs.CV · page 5

cs.CV 2026-05-21 reviewed

RL agent learns to plan and execute restoration tool sequences
OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization

Feng Zhu +4
cs.CV 2026-05-21 reviewed

Text embeddings boost ImageNet accuracy by up to 2.7 points
TextTeacher: What Can Language Teach About Images?

Tobias Christian Nauen +5
cs.CV 2026-05-21 reviewed

VISTA raises rare VCE event detection to 0.37 mAP on hidden test
VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

Bo-Cheng Qiu +5
cs.CV 2026-05-21 reviewed

Latent future scenes improve VLA driving over pixel reconstruction
LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

Xiaodong Mei +5
cs.CV 2026-05-21 reviewed

GenHAR raises cross-domain HAR accuracy 9.97% with 6.4x fewer operations
GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery

Zhiqing Hong +7
cs.CV 2026-05-21 reviewed

General models gain far more from images than medical ones in licensing exams
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

Yue Xun +12
cs.AI 2026-05-21 reviewed

Training-free pooling lifts Video LLM accuracy without retraining
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

Bingjun Luo +3
cs.CL 2026-05-21 reviewed

Anchoring attention improves multimodal reasoning with less data
Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Changyuan Tian +9
cs.CV 2026-05-21 reviewed

Spline-based warp gives accurate start for sparse 3DGS
TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

Hyeseong Kim +3
cs.CV 2026-05-21 reviewed

Benchmark enables open tree decomposition of images
COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

Junhyub Lee +2
cs.CV 2026-05-21 reviewed

Framework turns 2D heart ultrasounds into accurate 4D models
Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos

Yanan Liu +7
cs.CV 2026-05-21 reviewed

Multimodal side info sharpens ultra-low bitrate reconstructions
Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates

Guojun Xu +5
cs.CV 2026-05-21 reviewed

Frequency split lets VFX models train with far less data
EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation

Yue Ma +11
cs.CV 2026-05-21 reviewed

Broken artifacts flag memorized images in diffusion models
Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

Yuanmin Huang +6
cs.CV 2026-05-21 reviewed

Broken artifacts flag memorized training data in diffusion models
Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

Yuanmin Huang +6
cs.CV 2026-05-21 reviewed

Digital twin locates heart scars from ECG and MRI
Physiology and Anatomy Aware Inverse Inference of Myocardial Infarction for Cardiac Digital Twin

Mengxiao Wang +8
cs.CV 2026-05-21 reviewed

BEV maps from RGB-D cut tokens yet raise VLN success rates
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

Jiahao Yang +6
cs.CV 2026-05-21 reviewed

Hypernetwork builds on-the-fly LoRA adapters for continual VQA
HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

Yiran Wang +5
cs.CV 2026-05-21 reviewed

AgroVG benchmark shows top models at 0.35 Set-F1 on farm targets
AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

Haocheng Li +7
cs.CV 2026-05-21 reviewed

Mamba router splits resident and non-resident evidence for MRI
SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction

Pengcheng Fang +5
cs.CV 2026-05-21 reviewed

ForeSplat trains 3DGS predictors for faster optimizer convergence
ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting

Yuke Li +10
cs.CV 2026-05-21 reviewed

Optimization-aware training makes 3DGS predictions refine faster and better
ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting

Yuke Li +10
cs.CV 2026-05-21 reviewed

Dataset records real flooded roads for self-driving cars
FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

Connor Malone +2
cs.CV 2026-05-21 reviewed

Context-guided diffusion plus energy fix yields consistent agent paths
Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction

Lei Chu +1
cs.CV 2026-05-21 reviewed

Prior outputs double token cuts in video diffusion for 4.5x speedup
ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

Hangyeol Lee +1
cs.CV 2026-05-21 reviewed

Reasoning paths in training data lift 3D point cloud models
PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought

Chaoqi Chen +3
cs.CL 2026-05-21 reviewed

Latent reasoning beats text CoT for audio-visual tasks
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Yifan Dai +20
cs.CV 2026-05-21 reviewed

Output similarities cut token costs in diffusion models
Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

Hangyeol Lee +2
cs.CV 2026-05-21 reviewed

Fractal term sharpens ConvNeXt segmentation on medical images
ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation

Joao Batista Florindo +1
cs.CV 2026-05-21 reviewed

Method turns BIT phase volumes into realistic 3D H&E stains
Virtual 3D H&E Staining from Phase-contrast Back-illumination Interference Tomography

Anthony Song +5
cs.CV 2026-05-21 reviewed

Counterfactual RL raises video LLM dynamic accuracy
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

Dazhao Du +9
cs.CV 2026-05-21 reviewed

Vanilla transformer on DINOv2 features hits FID 1.14 on ImageNet
RiT: Vanilla Diffusion Transformers Suffice in Representation Space

Le Zhang +2
cs.CV 2026-05-21 reviewed

LVLMs collect emotional cues in middle layers then translate in deep layers
Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow

Chengsheng Zhang +3
cs.CV 2026-05-21 reviewed

Video frames close the detection gap between AI images and videos
Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

Zhengcen Li +6
cs.CV 2026-05-21 reviewed

Stabilizes video grounding via identify-then-measure evidence pool
Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding

Zelin Zheng +6
eess.IV 2026-05-21 reviewed

Dual pretraining ensemble lifts medical image accuracy
Entropy-Guided Self-Supervised Learning for Medical Image Classification

Joao Florindo +1
cs.CV 2026-05-21 reviewed

PDI-Net cuts infrared detection latency by 84 percent
Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection

Xuquan Wang +9
cs.CV 2026-05-21 reviewed

Bounding box trajectories top pose methods for video anomaly detection
Bounding-Box Trajectories Matter for Video Anomaly Detection

Inpyo Song +1
cs.CV 2026-05-21 reviewed

MLLMs spot correct video timing in prefill but forget during answers
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Dazhao Du +7
cs.CV 2026-05-21 reviewed

Video LLMs evolve reasoning from raw clips without labels
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

Shiqi Huang +5
cs.CV 2026-05-21 reviewed

Visual-advantage distillation outperforms standard methods on VLM benchmarks
Visual-Advantage On-Policy Distillation for Vision-Language Models

Ruiqi Liu +10
cs.CV 2026-05-21 reviewed

VLMs favor SDG priors over evidence on 550k-task benchmark
SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

Zihang Lin +3
cs.CV 2026-05-21 reviewed

MAVEN pipeline annotates 5300 videos so 8B VLM beats Gemini on CCTV reasoning
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

Han Zhang +4
cs.CV 2026-05-21 reviewed

Network lifts stereo super-resolution via epipolar matching
Multi-scale interaction network for stereo image super-resolution

Liyi Xu +1
cs.CV 2026-05-21 reviewed

Reward-guided scaling lifts diffusion image rewards by 60%
Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

Gang Dai +4
cs.CV 2026-05-21 reviewed

One CT model matches specialized tools on segmentation to retrieval
Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Yuheng Li +7
cs.CV 2026-05-21 reviewed

One CT model matches task-specific results on five task families
Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

Yuheng Li +7
cs.CV 2026-05-21 reviewed

Gated fusion brings thermal vision to frozen VLMs
Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception

Rusiru Thushara +3
cs.CV 2026-05-21 reviewed

Staged fusion of text audio vision reaches 0.47 emotion correlation
Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction

Dinithi Dissanayake +4
cs.CV 2026-05-21 reviewed

Modular experts resolve gradient conflicts in multi-modal medical pretraining
Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

Yuting He +2