pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

9568 papers in cs.CV · page 3

  1. cs.CV 2026-05-21 reviewed
    Benchmark shows MLLMs fail on 16-minute continuous video reasoning

    VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

    Haichen He +5

  2. cs.CV 2026-05-21 reviewed
    Projector fix lifts Video-LLM motion direction accuracy from 26% to 85%

    Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

    Jongseo Lee +5

  3. cs.CV 2026-05-21 reviewed
    Camera pose tokens lift video model spatial scores 4.5-6.5%

    Cambrian-P: Pose-Grounded Video Understanding

    Jihan Yang +7

  4. cs.CV 2026-05-21 reviewed
    Reasoning adds secondary motions for natural video

    MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

    Lee Hsin-Ying +5

  5. cs.RO 2026-05-21 reviewed
    Self-awareness module improves language-guided navigation

    AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

    Wenxuan Guo +9

  6. cs.RO 2026-05-21 reviewed
    Gestures raise robot object selection accuracy in cluttered scenes

    GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

    Wenxuan Guo +9

  7. cs.CV 2026-05-21 reviewed
    Dashcam videos turned into full AV multi-sensor data

    Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

    Jiahao Wang +14

  8. cs.CV 2026-05-21 reviewed
    Metro suicide risk scored from video by tracking and heatmaps

    Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations

    Safwen Naimi +3

  9. cs.CV 2026-05-21 reviewed
    VLMs keep high scores after most image tokens are deleted

    Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

    Zixuan Lan +3

  10. cs.CV 2026-05-21 reviewed
    Queries raise PSNR by 3.6 dB and cut convergence time by 3x in frozen autoencoders

    DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

    Tianhang Wang +5

  11. cs.CV 2026-05-21 reviewed
    Synthetic faces alone match real data for rare pediatric disease AI

    Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition

    Ganlin Feng +6

  12. cs.CV 2026-05-21 reviewed
    Generated images show anomalous ultra-high-frequency spectral uplift

    Spectral Tail Auxiliary Learning for AI-Generated Image Detection

    Xingyi Li +4

  13. cs.CV 2026-05-21 reviewed
    Retrieval keeps video worlds consistent at double speed

    WorldKV: Efficient World Memory with World Retrieval and Compression

    Jung Yi +5

  14. cs.CV 2026-05-21 reviewed
    Simulated dense placements train IMU model that ignores sensor setup

    AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

    Baiyu Chen +7

  15. cs.CV 2026-05-21 reviewed
    Multiview cues and orientation prompts lift zero-shot action recognition

    Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions

    Yannick Porto +3

  16. cs.CV 2026-05-21 reviewed
    Synthetic viewpoints plus state-space encoding boost action detection

    Improving Viewpoint-Invariance and Temporal Consistency for Action Detection

    Yannick Porto +3

  17. cs.CV 2026-05-21 reviewed
    Disentangling vision-language embeddings without added dimensions

    Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

    Piotr Kubaty +5

  18. cs.CV 2026-05-21 reviewed
    Taylor expansion picks surprising frames in long videos

    Swift Sampling: Selecting Temporal Surprises via Taylor Series

    Dahye Kim +5

  19. cs.CV 2026-05-21 reviewed
    One ConvNeXt model serves many compute budgets

    Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

    Janek Haberer +2

  20. cs.CV 2026-05-21 reviewed
    Coherent behavior vectors let VLA models match top results with half the data

    From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

    Bing Hu +6

  21. cs.CV 2026-05-21 reviewed
    SEGA adapts attention scaling to latent frequencies for higher-res DiT outputs

    SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

    Javad Rajabi +4

  22. cs.CV 2026-05-21 reviewed
    Sparse autoencoder links reasoning steps to image masks

    SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

    Zhenyu Lu +6

  23. cs.CL 2026-05-21 reviewed
    Images boost LLM poetry detectors past RoBERTa

    Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

    Shanshan Wang +8

  24. cs.CV 2026-05-21 reviewed
    Nonce substitutions rank captions for better VL data selection

    What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

    Hyejin Go +2

  25. cs.CV 2026-05-21 reviewed
    Causal model matches age changes in spine DXA images

    From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder

    Yilin Zhang +3

  26. cs.LG 2026-05-21 reviewed
    CAME-Grad optimizer lifts radiology reports by 2 percent

    The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution

    Erjian Zhang +3

  27. cs.LG 2026-05-21 reviewed
    CAME-Grad fixes gradient double dilemma in report generation

    The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution

    Erjian Zhang +3

  28. cs.CV 2026-05-21 reviewed
    Five functional body clusters improve full-pose reconstruction from head and hands

    AtomicMotion: Learning Human Motion From Different Human Parts

    Runzhen Liu +2

  29. cs.CV 2026-05-21 reviewed
    Physics priors train dense human scene flow from monocular video

    H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

    Zhanbo Huang +2

  30. cs.CV 2026-05-21 reviewed
    Graph reasoning turns radiology reports into precise 3D lesion maps

    GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT

    Shuo Jiang +11

  31. cs.CV 2026-05-21 reviewed
    Head-conditioned LoRA lifts gaze following on non-salient targets

    Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

    Shijing Wang +6

  32. cs.RO 2026-05-21 reviewed
    Dual-interval motion cues decouple ego-motion for UAV detection

    Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

    Liuyang Wang +1

  33. cs.CV 2026-05-21 reviewed
    No single noisy-label method wins for frozen vision models

    Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure

    Zitong Li +1

  34. cs.CV 2026-05-21 reviewed
    3D reconstruction turns floorplan localization into alignment task

    SceneAligner: 3D-Grounded Floorplan Localization in the Wild

    Junhyeong Cho +2

  35. cs.CV 2026-05-21 reviewed
    New metric shows detection limits online map accuracy

    Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping

    Chouaib Bencheikh Lehocine +3

  36. cs.CV 2026-05-21 reviewed
    Attention maps for tumor sub-regions come free in one lightweight 3D model

    SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor Segmentation

    Hasaan Maqsood +4

  37. cs.CV 2026-05-21 reviewed
    Generative models create controlled videos to test MLLM spatio-temporal reasoning

    VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

    Jinho Park +3

  38. cs.CV 2026-05-21 reviewed
    Fourier shape descriptors create time-consistent cell phantom videos

    Cell Phantom Video Generation in Elliptical Fourier Descriptor Domain

    Francesco Benedetto +3

  39. cs.CV 2026-05-21 reviewed
    Geometry must ground visual tokens before reasoning starts

    GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

    Deshui Miao +5

  40. cs.CV 2026-05-21 reviewed
    Unified model handles many fashion search types at once

    FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning

    Haokun Wen +5

  41. cs.CV 2026-05-21 reviewed
    Multimodal data improves two-wheeler rider behavior recognition

    MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding

    Varun A. Paturkar +2

  42. cs.CV 2026-05-21 reviewed
    Similar cases form graphs that refine medical image diagnoses

    Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement

    Yiming Xu +5

  43. cs.CV 2026-05-21 reviewed
    Motion and geometry cues boost SAM 2 tracking on nonlinear scenarios

    Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

    Deyi Zhu +6

  44. cs.CV 2026-05-21 reviewed
    Degraded images break spatial reasoning in current AI

    SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

    Xiaolong Zhou +10

  45. cs.AI 2026-05-21 reviewed
    Latent sharing speeds up collaborative driving coordination

    LACO: Adaptive Latent Communication for Collaborative Driving

    Tianhao Chen +2

  46. cs.CV 2026-05-21 reviewed
    Training-free method segments fine-grained fungi without retraining

    Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline

    Sebastian Cavada +2

  47. cs.CV 2026-05-21 reviewed
    Discarded classifier weights act as semantic anchors

    Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

    David M\'endez +2

  48. cs.CV 2026-05-21 reviewed
    Multi-agent self-evolution sets SOTA on image retrieval benchmarks

    DeliCIR: Deliberative Test-Time Evolutionary Hierarchical Multi-Agents for Composed Image Retrieval

    Xingtian Pei +6

  49. cs.CV 2026-05-21 reviewed
    Masked metric improves agreement with humans on concept fidelity

    MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

    Patryk Bartkowiak +5

  50. cs.CV 2026-05-21 reviewed
    Fused geometry and appearance metric predicts synthetic data transfer

    SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data

    Patryk Bartkowiak +4