archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 4

cs.CV 2026-04-22 reviewed

Human critique of structured video descriptions beats Gemini
Building a Precise Video Language with Human-AI Oversight

Zhiqiu Lin +15
cs.CV 2026-04-22 reviewed

One framework unifies three zero-shot visual retrieval tasks
UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

Haokun Wen +5
cs.MM 2026-04-22 reviewed

Joint enlargement boosts micro-video popularity forecasts
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

Dali Wang +5
cs.MM 2026-04-22 reviewed

Closed-loop control hits 2% bitrate error for learned video codecs
Feedback-Driven Rate Control for Learned Video Compression

Zhiheng Xu +3
cs.CV 2026-04-21 reviewed

Verifiable rewards lift LLM slide quality with just 5K examples
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

Yiming Pan +8
cs.MM 2026-04-21 reviewed

Smiles boost emotional valence after negative moments in trauma testimonies
Smiling Regulates Emotion During Traumatic Recollection

Marcus Ma +9
cs.CV 2026-04-21 reviewed

AutoAWG halves error in adverse weather video generation
AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos

Jiagao Hu +8
cs.MM 2026-04-20 reviewed

SMPL-X priors enable real-time 3D human reconstruction with fine details
High-Fidelity 3D Gaussian Human Reconstruction via Region-Aware Initialization and Geometric Priors

Yang Liu +1
cs.CV 2026-04-20 reviewed

Hybrid transformer-diffusion model recovers 3D bodies under occlusion
Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery

Yang Liu +1
cs.CV 2026-04-20 reviewed

3D adapters add geometric and physical awareness to VLMs for embodied tasks
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

Kangan Qian +15
cs.CL 2026-04-20 reviewed

Retrieval model aligns narratives across instances to spot fake news
Retrieval-Augmented Multimodal Model for Fake News Detection

Yiheng Li +3
cs.CV 2026-04-19 reviewed

Q-Gate routes video keyframes by query to cut modality noise
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

Shaoguang Wang +4
cs.CV 2026-04-17 reviewed

Merged single-modality traces yield SOTA AV reasoners
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

Edson Araujo +7
cs.MM 2026-04-17 reviewed

Model catches multimodal fake news by tracking narrative changes
MOMENTA: Mixture-of-Experts Over Multimodal Embeddings with Neural Temporal Aggregation for Misinformation Detection

Yeganeh Abdollahinejad +6
cs.CV 2026-04-17 reviewed

The paper proposes SIMMER, a single MLLM-based model that embeds food images and recipe…
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

Keisuke Gomi +1
cs.MM 2026-04-16 reviewed

8B model beats Gemini-2.5-Pro on video script creation
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

Huanran Hu +6
cs.MM 2026-04-16 reviewed

ControlFoley resolves conflicts to control video-to-audio output
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Jianxuan Yang +12
cs.CV 2026-04-16 reviewed

Retrieval picks tools for multimodal queries without retraining
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Gabriele Mattioli +5
cs.CV 2026-04-16 reviewed

Crowdsourced data yields benchmark for video saliency prediction
NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

Andrey Moskalenko +42
cs.SD 2026-04-16 reviewed

Hybrid model grounds audio perception before reasoning
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

Jieyi Wang +3
cs.MM 2026-04-16 reviewed

Framework maps satellite images to realistic soundscapes
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

Kunlin Wu +8
cs.CV 2026-04-16 reviewed

One-step model generates talking avatars 120 times faster
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

Xiangyu Liu +6
cs.LG 2026-04-15 reviewed

Station data guides radar attention to improve local rain forecasts
M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention

Sanjeev Panta +4
cs.CV 2026-04-15 reviewed

One model generates and edits human-object interactions
OneHOI: Unifying Human-Object Interaction Generation and Editing

Jiun Tian Hoe +4
cs.CR 2026-04-15 reviewed

AI fakes spread via likes not comments and fool detectors more each year
The Synthetic Media Shift: Tracking the Rise, Virality, and Detectability of AI-Generated Multimodal Misinformation

Zacharias Chrysidis +2
cs.CY 2026-04-15 reviewed

ANVIL creates analogy animations rated adequate by educators
ANVIL: Analogies and Videos for Lecturers

Yuri Noviello +2
cs.CV 2026-04-15 reviewed

Survey urges target-based tests to make AI image generators fairer
Operationalizing Fairness in Text-to-Image Models: A Survey of Bias, Fairness Audits and Mitigation Strategies

Megan Smith +5
cs.MM 2026-04-15 reviewed

Benchmark tests AI on spotting audio-visual conflicts in videos
AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

Zixuan Chen +8
cs.CV 2026-04-14 reviewed

3D priors lift geo-localization accuracy in new domains and weather
GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization

Hongyang Zhang +6
cs.SD 2026-04-14 reviewed

Training objective curbs audio-model timestamp hallucinations
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

Luoyi Sun +5
cs.CV 2026-04-14 reviewed

Frozen MLLM plus tiny branch matches full VQA retraining
DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

Xinyue Li +5
cs.CV 2026-04-14 reviewed

Deepfake detectors miss listening reactions
Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

Miao Liu +3
cs.AI 2026-04-14 reviewed

Memory-augmented agents jailbreak VLMs on natural images
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Jianhao Chen +4
cs.CV 2026-04-14 reviewed

Video LLMs reach only 71.58% on new esports video benchmark
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

Jianzhe Ma +5
cs.CV 2026-04-14 reviewed

Text and elevation data sharpen satellite maps of terraced farmland
GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

Zhiwei Zhang +9
cs.SD 2026-04-14 reviewed

Staged guidance in diffusion model yields precise movie dubbing
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

Gaoxiang Cong +6
cs.HC 2026-04-13 reviewed

Speech with sketches improves AI design intent match
When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLMs

Weiyan Shi +2
cs.RO 2026-04-13 reviewed

Drift-aware quantization keeps robot VLAs accurate at low bits
DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models

Siyuan Xu +4
cs.HC 2026-04-13 reviewed

Five synced streams adapt XR de-escalation training
From Multimodal Signals to Adaptive XR Experiences for De-escalation Training

Birgit Nierula +6
cs.CV 2026-04-13 reviewed

Feedforward network renders novel views in real time from sparse cameras
3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

Stefan Schulz +4
cs.CV 2026-04-13 reviewed

LLM knowledge hierarchy lifts image clustering on 14 of 20 datasets
Hierarchical Textual Knowledge for Enhanced Image Clustering

Yijie Zhong +3
cs.CV 2026-04-13 reviewed

8B audio-visual model rivals Gemini on cinematic scripts
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Junfu Pu +3
cs.SD 2026-04-12 reviewed

One model unifies audio generation and editing across sound music and speech
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Zeyue Tian +10
cs.CV 2026-04-12 reviewed

New dataset pairs real clean video with synthetic rain and snow
LoViF 2026 The First Challenge on Weather Removal in Videos

Chenghao Qian +25
cs.SD 2026-04-12 reviewed

Synthetic labels keep music-flavor structure intact
Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences

Matteo Spanio +2
eess.IV 2026-04-12 reviewed

Graph saliency priors from fMRI sharpen brain image reconstructions
Brain-Grasp: Graph-based Saliency Priors for Improved fMRI-based Visual Brain Decoding

Mohammad Moradi +3
cs.AI 2026-04-11 reviewed

LLMs pick financial tools well but reason poorly from results
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

Yupeng Cao +13
cs.MM 2026-04-10 reviewed

MRI change vectors sharpen epilepsy surgery forecasts
Neuro-Oracle: A Trajectory-Aware Agentic RAG Framework for Interpretable Epilepsy Surgical Prognosis

Aizierjiang Aiersilan +1
cs.CV 2026-04-10 reviewed

Text priors boost stereo volume estimates
Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

Gautham Vinod +3
eess.IV 2026-04-10 reviewed

Multi-task JRD model boosts VCM by 3.86% BD-mAP
Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding Application

Junqi Liu +4