archive
Every paper Pith has read. Search by title, abstract, or pith.
378 papers in cs.MM · page 4
-
Human critique of structured video descriptions beats Gemini
Building a Precise Video Language with Human-AI Oversight
-
One framework unifies three zero-shot visual retrieval tasks
UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval
-
Joint enlargement boosts micro-video popularity forecasts
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
-
Closed-loop control hits 2% bitrate error for learned video codecs
Feedback-Driven Rate Control for Learned Video Compression
-
Verifiable rewards lift LLM slide quality with just 5K examples
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards
-
Smiles boost emotional valence after negative moments in trauma testimonies
Smiling Regulates Emotion During Traumatic Recollection
-
AutoAWG halves error in adverse weather video generation
AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos
-
SMPL-X priors enable real-time 3D human reconstruction with fine details
High-Fidelity 3D Gaussian Human Reconstruction via Region-Aware Initialization and Geometric Priors
-
Hybrid transformer-diffusion model recovers 3D bodies under occlusion
Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery
-
3D adapters add geometric and physical awareness to VLMs for embodied tasks
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
-
Retrieval model aligns narratives across instances to spot fake news
Retrieval-Augmented Multimodal Model for Fake News Detection
-
Q-Gate routes video keyframes by query to cut modality noise
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
-
Merged single-modality traces yield SOTA AV reasoners
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
-
Model catches multimodal fake news by tracking narrative changes
MOMENTA: Mixture-of-Experts Over Multimodal Embeddings with Neural Temporal Aggregation for Misinformation Detection
-
The paper proposes SIMMER, a single MLLM-based model that embeds food images and recipe…
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
-
8B model beats Gemini-2.5-Pro on video script creation
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
-
ControlFoley resolves conflicts to control video-to-audio output
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
-
Retrieval picks tools for multimodal queries without retraining
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
-
Crowdsourced data yields benchmark for video saliency prediction
NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results
-
Hybrid model grounds audio perception before reasoning
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
-
Framework maps satellite images to realistic soundscapes
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
-
One-step model generates talking avatars 120 times faster
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
-
Station data guides radar attention to improve local rain forecasts
M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention
-
One model generates and edits human-object interactions
OneHOI: Unifying Human-Object Interaction Generation and Editing
-
AI fakes spread via likes not comments and fool detectors more each year
The Synthetic Media Shift: Tracking the Rise, Virality, and Detectability of AI-Generated Multimodal Misinformation
-
ANVIL creates analogy animations rated adequate by educators
ANVIL: Analogies and Videos for Lecturers
-
Survey urges target-based tests to make AI image generators fairer
Operationalizing Fairness in Text-to-Image Models: A Survey of Bias, Fairness Audits and Mitigation Strategies
-
Benchmark tests AI on spotting audio-visual conflicts in videos
AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
-
3D priors lift geo-localization accuracy in new domains and weather
GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization
-
Training objective curbs audio-model timestamp hallucinations
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
-
Frozen MLLM plus tiny branch matches full VQA retraining
DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment
-
Deepfake detectors miss listening reactions
Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis
-
Memory-augmented agents jailbreak VLMs on natural images
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
-
Video LLMs reach only 71.58% on new esports video benchmark
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
-
Text and elevation data sharpen satellite maps of terraced farmland
GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality
-
Staged guidance in diffusion model yields precise movie dubbing
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
-
Speech with sketches improves AI design intent match
When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLMs
-
Drift-aware quantization keeps robot VLAs accurate at low bits
DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models
-
Five synced streams adapt XR de-escalation training
From Multimodal Signals to Adaptive XR Experiences for De-escalation Training
-
Feedforward network renders novel views in real time from sparse cameras
3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis
-
LLM knowledge hierarchy lifts image clustering on 14 of 20 datasets
Hierarchical Textual Knowledge for Enhanced Image Clustering
-
8B audio-visual model rivals Gemini on cinematic scripts
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
-
One model unifies audio generation and editing across sound music and speech
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
-
New dataset pairs real clean video with synthetic rain and snow
LoViF 2026 The First Challenge on Weather Removal in Videos
-
Synthetic labels keep music-flavor structure intact
Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences
-
Graph saliency priors from fMRI sharpen brain image reconstructions
Brain-Grasp: Graph-based Saliency Priors for Improved fMRI-based Visual Brain Decoding
-
LLMs pick financial tools well but reason poorly from results
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
-
MRI change vectors sharpen epilepsy surgery forecasts
Neuro-Oracle: A Trajectory-Aware Agentic RAG Framework for Interpretable Epilepsy Surgical Prognosis
-
Text priors boost stereo volume estimates
Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception
-
Multi-task JRD model boosts VCM by 3.86% BD-mAP
Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding Application