pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

9568 papers in cs.CV · page 5

  1. cs.CV 2026-05-21 reviewed
    RL agent learns to plan and execute restoration tool sequences

    OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization

    Feng Zhu +4

  2. cs.CV 2026-05-21 reviewed
    Text embeddings boost ImageNet accuracy by up to 2.7 points

    TextTeacher: What Can Language Teach About Images?

    Tobias Christian Nauen +5

  3. cs.CV 2026-05-21 reviewed
    VISTA raises rare VCE event detection to 0.37 mAP on hidden test

    VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

    Bo-Cheng Qiu +5

  4. cs.CV 2026-05-21 reviewed
    Latent future scenes improve VLA driving over pixel reconstruction

    LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

    Xiaodong Mei +5

  5. cs.CV 2026-05-21 reviewed
    GenHAR raises cross-domain HAR accuracy 9.97% with 6.4x fewer operations

    GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery

    Zhiqing Hong +7

  6. cs.CV 2026-05-21 reviewed
    General models gain far more from images than medical ones in licensing exams

    JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

    Yue Xun +12

  7. cs.AI 2026-05-21 reviewed
    Training-free pooling lifts Video LLM accuracy without retraining

    Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

    Bingjun Luo +3

  8. cs.CL 2026-05-21 reviewed
    Anchoring attention improves multimodal reasoning with less data

    Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

    Changyuan Tian +9

  9. cs.CV 2026-05-21 reviewed
    Spline-based warp gives accurate start for sparse 3DGS

    TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

    Hyeseong Kim +3

  10. cs.CV 2026-05-21 reviewed
    Benchmark enables open tree decomposition of images

    COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

    Junhyub Lee +2

  11. cs.CV 2026-05-21 reviewed
    Framework turns 2D heart ultrasounds into accurate 4D models

    Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos

    Yanan Liu +7

  12. cs.CV 2026-05-21 reviewed
    Multimodal side info sharpens ultra-low bitrate reconstructions

    Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates

    Guojun Xu +5

  13. cs.CV 2026-05-21 reviewed
    Frequency split lets VFX models train with far less data

    EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation

    Yue Ma +11

  14. cs.CV 2026-05-21 reviewed
    Broken artifacts flag memorized images in diffusion models

    Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

    Yuanmin Huang +6

  15. cs.CV 2026-05-21 reviewed
    Broken artifacts flag memorized training data in diffusion models

    Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

    Yuanmin Huang +6

  16. cs.CV 2026-05-21 reviewed
    Digital twin locates heart scars from ECG and MRI

    Physiology and Anatomy Aware Inverse Inference of Myocardial Infarction for Cardiac Digital Twin

    Mengxiao Wang +8

  17. cs.CV 2026-05-21 reviewed
    BEV maps from RGB-D cut tokens yet raise VLN success rates

    GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

    Jiahao Yang +6

  18. cs.CV 2026-05-21 reviewed
    Hypernetwork builds on-the-fly LoRA adapters for continual VQA

    HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

    Yiran Wang +5

  19. cs.CV 2026-05-21 reviewed
    AgroVG benchmark shows top models at 0.35 Set-F1 on farm targets

    AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

    Haocheng Li +7

  20. cs.CV 2026-05-21 reviewed
    Mamba router splits resident and non-resident evidence for MRI

    SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction

    Pengcheng Fang +5

  21. cs.CV 2026-05-21 reviewed
    ForeSplat trains 3DGS predictors for faster optimizer convergence

    ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting

    Yuke Li +10

  22. cs.CV 2026-05-21 reviewed
    Optimization-aware training makes 3DGS predictions refine faster and better

    ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting

    Yuke Li +10

  23. cs.CV 2026-05-21 reviewed
    Dataset records real flooded roads for self-driving cars

    FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

    Connor Malone +2

  24. cs.CV 2026-05-21 reviewed
    Context-guided diffusion plus energy fix yields consistent agent paths

    Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction

    Lei Chu +1

  25. cs.CV 2026-05-21 reviewed
    Prior outputs double token cuts in video diffusion for 4.5x speedup

    ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

    Hangyeol Lee +1

  26. cs.CV 2026-05-21 reviewed
    Reasoning paths in training data lift 3D point cloud models

    PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought

    Chaoqi Chen +3

  27. cs.CL 2026-05-21 reviewed
    Latent reasoning beats text CoT for audio-visual tasks

    LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

    Yifan Dai +20

  28. cs.CV 2026-05-21 reviewed
    Output similarities cut token costs in diffusion models

    Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

    Hangyeol Lee +2

  29. cs.CV 2026-05-21 reviewed
    Fractal term sharpens ConvNeXt segmentation on medical images

    ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation

    Joao Batista Florindo +1

  30. cs.CV 2026-05-21 reviewed
    Method turns BIT phase volumes into realistic 3D H&E stains

    Virtual 3D H&E Staining from Phase-contrast Back-illumination Interference Tomography

    Anthony Song +5

  31. cs.CV 2026-05-21 reviewed
    Counterfactual RL raises video LLM dynamic accuracy

    Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

    Dazhao Du +9

  32. cs.CV 2026-05-21 reviewed
    Vanilla transformer on DINOv2 features hits FID 1.14 on ImageNet

    RiT: Vanilla Diffusion Transformers Suffice in Representation Space

    Le Zhang +2

  33. cs.CV 2026-05-21 reviewed
    LVLMs collect emotional cues in middle layers then translate in deep layers

    Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow

    Chengsheng Zhang +3

  34. cs.CV 2026-05-21 reviewed
    Video frames close the detection gap between AI images and videos

    Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

    Zhengcen Li +6

  35. cs.CV 2026-05-21 reviewed
    Stabilizes video grounding via identify-then-measure evidence pool

    Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding

    Zelin Zheng +6

  36. eess.IV 2026-05-21 reviewed
    Dual pretraining ensemble lifts medical image accuracy

    Entropy-Guided Self-Supervised Learning for Medical Image Classification

    Joao Florindo +1

  37. cs.CV 2026-05-21 reviewed
    PDI-Net cuts infrared detection latency by 84 percent

    Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection

    Xuquan Wang +9

  38. cs.CV 2026-05-21 reviewed
    Bounding box trajectories top pose methods for video anomaly detection

    Bounding-Box Trajectories Matter for Video Anomaly Detection

    Inpyo Song +1

  39. cs.CV 2026-05-21 reviewed
    MLLMs spot correct video timing in prefill but forget during answers

    MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

    Dazhao Du +7

  40. cs.CV 2026-05-21 reviewed
    Video LLMs evolve reasoning from raw clips without labels

    EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

    Shiqi Huang +5

  41. cs.CV 2026-05-21 reviewed
    Visual-advantage distillation outperforms standard methods on VLM benchmarks

    Visual-Advantage On-Policy Distillation for Vision-Language Models

    Ruiqi Liu +10

  42. cs.CV 2026-05-21 reviewed
    VLMs favor SDG priors over evidence on 550k-task benchmark

    SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

    Zihang Lin +3

  43. cs.CV 2026-05-21 reviewed
    MAVEN pipeline annotates 5300 videos so 8B VLM beats Gemini on CCTV reasoning

    MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

    Han Zhang +4

  44. cs.CV 2026-05-21 reviewed
    Network lifts stereo super-resolution via epipolar matching

    Multi-scale interaction network for stereo image super-resolution

    Liyi Xu +1

  45. cs.CV 2026-05-21 reviewed
    Reward-guided scaling lifts diffusion image rewards by 60%

    Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

    Gang Dai +4

  46. cs.CV 2026-05-21 reviewed
    One CT model matches specialized tools on segmentation to retrieval

    Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

    Yuheng Li +7

  47. cs.CV 2026-05-21 reviewed
    One CT model matches task-specific results on five task families

    Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

    Yuheng Li +7

  48. cs.CV 2026-05-21 reviewed
    Gated fusion brings thermal vision to frozen VLMs

    Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception

    Rusiru Thushara +3

  49. cs.CV 2026-05-21 reviewed
    Staged fusion of text audio vision reaches 0.47 emotion correlation

    Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction

    Dinithi Dissanayake +4

  50. cs.CV 2026-05-21 reviewed
    Modular experts resolve gradient conflicts in multi-modal medical pretraining

    Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

    Yuting He +2