pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 4

  1. cs.CV 2026-04-22 reviewed
    Human critique of structured video descriptions beats Gemini

    Building a Precise Video Language with Human-AI Oversight

    Zhiqiu Lin +15

  2. cs.CV 2026-04-22 reviewed
    One framework unifies three zero-shot visual retrieval tasks

    UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

    Haokun Wen +5

  3. cs.MM 2026-04-22 reviewed
    Joint enlargement boosts micro-video popularity forecasts

    Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

    Dali Wang +5

  4. cs.MM 2026-04-22 reviewed
    Closed-loop control hits 2% bitrate error for learned video codecs

    Feedback-Driven Rate Control for Learned Video Compression

    Zhiheng Xu +3

  5. cs.CV 2026-04-21 reviewed
    Verifiable rewards lift LLM slide quality with just 5K examples

    AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

    Yiming Pan +8

  6. cs.MM 2026-04-21 reviewed
    Smiles boost emotional valence after negative moments in trauma testimonies

    Smiling Regulates Emotion During Traumatic Recollection

    Marcus Ma +9

  7. cs.CV 2026-04-21 reviewed
    AutoAWG halves error in adverse weather video generation

    AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos

    Jiagao Hu +8

  8. cs.MM 2026-04-20 reviewed
    SMPL-X priors enable real-time 3D human reconstruction with fine details

    High-Fidelity 3D Gaussian Human Reconstruction via Region-Aware Initialization and Geometric Priors

    Yang Liu +1

  9. cs.CV 2026-04-20 reviewed
    Hybrid transformer-diffusion model recovers 3D bodies under occlusion

    Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery

    Yang Liu +1

  10. cs.CV 2026-04-20 reviewed
    3D adapters add geometric and physical awareness to VLMs for embodied tasks

    XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    Kangan Qian +15

  11. cs.CL 2026-04-20 reviewed
    Retrieval model aligns narratives across instances to spot fake news

    Retrieval-Augmented Multimodal Model for Fake News Detection

    Yiheng Li +3

  12. cs.CV 2026-04-19 reviewed
    Q-Gate routes video keyframes by query to cut modality noise

    Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

    Shaoguang Wang +4

  13. cs.CV 2026-04-17 reviewed
    Merged single-modality traces yield SOTA AV reasoners

    AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

    Edson Araujo +7

  14. cs.MM 2026-04-17 reviewed
    Model catches multimodal fake news by tracking narrative changes

    MOMENTA: Mixture-of-Experts Over Multimodal Embeddings with Neural Temporal Aggregation for Misinformation Detection

    Yeganeh Abdollahinejad +6

  15. cs.CV 2026-04-17 reviewed
    The paper proposes SIMMER, a single MLLM-based model that embeds food images and recipe…

    SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

    Keisuke Gomi +1

  16. cs.MM 2026-04-16 reviewed
    8B model beats Gemini-2.5-Pro on video script creation

    MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

    Huanran Hu +6

  17. cs.MM 2026-04-16 reviewed
    ControlFoley resolves conflicts to control video-to-audio output

    ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

    Jianxuan Yang +12

  18. cs.CV 2026-04-16 reviewed
    Retrieval picks tools for multimodal queries without retraining

    RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

    Gabriele Mattioli +5

  19. cs.CV 2026-04-16 reviewed
    Crowdsourced data yields benchmark for video saliency prediction

    NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

    Andrey Moskalenko +42

  20. cs.SD 2026-04-16 reviewed
    Hybrid model grounds audio perception before reasoning

    Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

    Jieyi Wang +3

  21. cs.MM 2026-04-16 reviewed
    Framework maps satellite images to realistic soundscapes

    Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

    Kunlin Wu +8

  22. cs.CV 2026-04-16 reviewed
    One-step model generates talking avatars 120 times faster

    TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

    Xiangyu Liu +6

  23. cs.LG 2026-04-15 reviewed
    Station data guides radar attention to improve local rain forecasts

    M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention

    Sanjeev Panta +4

  24. cs.CV 2026-04-15 reviewed
    One model generates and edits human-object interactions

    OneHOI: Unifying Human-Object Interaction Generation and Editing

    Jiun Tian Hoe +4

  25. cs.CR 2026-04-15 reviewed
    AI fakes spread via likes not comments and fool detectors more each year

    The Synthetic Media Shift: Tracking the Rise, Virality, and Detectability of AI-Generated Multimodal Misinformation

    Zacharias Chrysidis +2

  26. cs.CY 2026-04-15 reviewed
    ANVIL creates analogy animations rated adequate by educators

    ANVIL: Analogies and Videos for Lecturers

    Yuri Noviello +2

  27. cs.CV 2026-04-15 reviewed
    Survey urges target-based tests to make AI image generators fairer

    Operationalizing Fairness in Text-to-Image Models: A Survey of Bias, Fairness Audits and Mitigation Strategies

    Megan Smith +5

  28. cs.MM 2026-04-15 reviewed
    Benchmark tests AI on spotting audio-visual conflicts in videos

    AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

    Zixuan Chen +8

  29. cs.CV 2026-04-14 reviewed
    3D priors lift geo-localization accuracy in new domains and weather

    GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization

    Hongyang Zhang +6

  30. cs.SD 2026-04-14 reviewed
    Training objective curbs audio-model timestamp hallucinations

    SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

    Luoyi Sun +5

  31. cs.CV 2026-04-14 reviewed
    Frozen MLLM plus tiny branch matches full VQA retraining

    DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

    Xinyue Li +5

  32. cs.CV 2026-04-14 reviewed
    Deepfake detectors miss listening reactions

    Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

    Miao Liu +3

  33. cs.AI 2026-04-14 reviewed
    Memory-augmented agents jailbreak VLMs on natural images

    Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

    Jianhao Chen +4

  34. cs.CV 2026-04-14 reviewed
    Video LLMs reach only 71.58% on new esports video benchmark

    EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

    Jianzhe Ma +5

  35. cs.CV 2026-04-14 reviewed
    Text and elevation data sharpen satellite maps of terraced farmland

    GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

    Zhiwei Zhang +9

  36. cs.SD 2026-04-14 reviewed
    Staged guidance in diffusion model yields precise movie dubbing

    CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

    Gaoxiang Cong +6

  37. cs.HC 2026-04-13 reviewed
    Speech with sketches improves AI design intent match

    When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLMs

    Weiyan Shi +2

  38. cs.RO 2026-04-13 reviewed
    Drift-aware quantization keeps robot VLAs accurate at low bits

    DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models

    Siyuan Xu +4

  39. cs.HC 2026-04-13 reviewed
    Five synced streams adapt XR de-escalation training

    From Multimodal Signals to Adaptive XR Experiences for De-escalation Training

    Birgit Nierula +6

  40. cs.CV 2026-04-13 reviewed
    Feedforward network renders novel views in real time from sparse cameras

    3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

    Stefan Schulz +4

  41. cs.CV 2026-04-13 reviewed
    LLM knowledge hierarchy lifts image clustering on 14 of 20 datasets

    Hierarchical Textual Knowledge for Enhanced Image Clustering

    Yijie Zhong +3

  42. cs.CV 2026-04-13 reviewed
    8B audio-visual model rivals Gemini on cinematic scripts

    OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

    Junfu Pu +3

  43. cs.SD 2026-04-12 reviewed
    One model unifies audio generation and editing across sound music and speech

    Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

    Zeyue Tian +10

  44. cs.CV 2026-04-12 reviewed
    New dataset pairs real clean video with synthetic rain and snow

    LoViF 2026 The First Challenge on Weather Removal in Videos

    Chenghao Qian +25

  45. cs.SD 2026-04-12 reviewed
    Synthetic labels keep music-flavor structure intact

    Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences

    Matteo Spanio +2

  46. eess.IV 2026-04-12 reviewed
    Graph saliency priors from fMRI sharpen brain image reconstructions

    Brain-Grasp: Graph-based Saliency Priors for Improved fMRI-based Visual Brain Decoding

    Mohammad Moradi +3

  47. cs.AI 2026-04-11 reviewed
    LLMs pick financial tools well but reason poorly from results

    FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

    Yupeng Cao +13

  48. cs.MM 2026-04-10 reviewed
    MRI change vectors sharpen epilepsy surgery forecasts

    Neuro-Oracle: A Trajectory-Aware Agentic RAG Framework for Interpretable Epilepsy Surgical Prognosis

    Aizierjiang Aiersilan +1

  49. cs.CV 2026-04-10 reviewed
    Text priors boost stereo volume estimates

    Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

    Gautham Vinod +3

  50. eess.IV 2026-04-10 reviewed
    Multi-task JRD model boosts VCM by 3.86% BD-mAP

    Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding Application

    Junqi Liu +4