archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 5
-
RL agent learns to plan and execute restoration tool sequences
OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization
-
Text embeddings boost ImageNet accuracy by up to 2.7 points
TextTeacher: What Can Language Teach About Images?
-
VISTA raises rare VCE event detection to 0.37 mAP on hidden test
VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results
-
Latent future scenes improve VLA driving over pixel reconstruction
LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
-
GenHAR raises cross-domain HAR accuracy 9.97% with 6.4x fewer operations
GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery
-
General models gain far more from images than medical ones in licensing exams
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
-
Training-free pooling lifts Video LLM accuracy without retraining
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
-
Anchoring attention improves multimodal reasoning with less data
Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention
-
Spline-based warp gives accurate start for sparse 3DGS
TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
-
Benchmark enables open tree decomposition of images
COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition
-
Framework turns 2D heart ultrasounds into accurate 4D models
Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos
-
Multimodal side info sharpens ultra-low bitrate reconstructions
Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
-
Frequency split lets VFX models train with far less data
EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation
-
Broken artifacts flag memorized images in diffusion models
Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations
-
Broken artifacts flag memorized training data in diffusion models
Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations
-
Digital twin locates heart scars from ECG and MRI
Physiology and Anatomy Aware Inverse Inference of Myocardial Infarction for Cardiac Digital Twin
-
BEV maps from RGB-D cut tokens yet raise VLN success rates
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
-
Hypernetwork builds on-the-fly LoRA adapters for continual VQA
HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering
-
AgroVG benchmark shows top models at 0.35 Set-F1 on farm targets
AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding
-
Mamba router splits resident and non-resident evidence for MRI
SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction
-
ForeSplat trains 3DGS predictors for faster optimizer convergence
ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting
-
Optimization-aware training makes 3DGS predictions refine faster and better
ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting
-
Dataset records real flooded roads for self-driving cars
FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments
-
Context-guided diffusion plus energy fix yields consistent agent paths
Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction
-
Prior outputs double token cuts in video diffusion for 4.5x speedup
ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration
-
Reasoning paths in training data lift 3D point cloud models
PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought
-
Latent reasoning beats text CoT for audio-visual tasks
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
-
Output similarities cut token costs in diffusion models
Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness
-
Fractal term sharpens ConvNeXt segmentation on medical images
ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation
-
Method turns BIT phase volumes into realistic 3D H&E stains
Virtual 3D H&E Staining from Phase-contrast Back-illumination Interference Tomography
-
Counterfactual RL raises video LLM dynamic accuracy
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
-
Vanilla transformer on DINOv2 features hits FID 1.14 on ImageNet
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
-
LVLMs collect emotional cues in middle layers then translate in deep layers
Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow
-
Video frames close the detection gap between AI images and videos
Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
-
Stabilizes video grounding via identify-then-measure evidence pool
Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding
-
Dual pretraining ensemble lifts medical image accuracy
Entropy-Guided Self-Supervised Learning for Medical Image Classification
-
PDI-Net cuts infrared detection latency by 84 percent
Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection
-
Bounding box trajectories top pose methods for video anomaly detection
Bounding-Box Trajectories Matter for Video Anomaly Detection
-
MLLMs spot correct video timing in prefill but forget during answers
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues
-
Video LLMs evolve reasoning from raw clips without labels
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
-
Visual-advantage distillation outperforms standard methods on VLM benchmarks
Visual-Advantage On-Policy Distillation for Vision-Language Models
-
VLMs favor SDG priors over evidence on 550k-task benchmark
SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals
-
MAVEN pipeline annotates 5300 videos so 8B VLM beats Gemini on CCTV reasoning
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
-
Network lifts stereo super-resolution via epipolar matching
Multi-scale interaction network for stereo image super-resolution
-
Reward-guided scaling lifts diffusion image rewards by 60%
Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion
-
One CT model matches specialized tools on segmentation to retrieval
Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining
-
One CT model matches task-specific results on five task families
Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining
-
Gated fusion brings thermal vision to frozen VLMs
Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception
-
Staged fusion of text audio vision reaches 0.47 emotion correlation
Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction
-
Modular experts resolve gradient conflicts in multi-modal medical pretraining
Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models