archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 5

cs.MM 2026-04-10 reviewed

Gaze-matched tuning improves AI simulation of user clicks
Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

Lingfeng Huang +4
cs.MM 2026-04-10 reviewed

Tri-stage pruning speeds MVLA inference 2.55x by tracking 2D/3D salience
2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

Zihao Zheng +10
cs.MM 2026-04-10 reviewed

Self-generated pseudo-fakes boost deepfake detector generalization
Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes

Zihe Wei +1
cs.CV 2026-04-10 reviewed

Frozen vision models locate image manipulations via a small adapter
Off-the-shelf Vision Models Benefit Image Manipulation Localization

Zhengxuan Zhang +4
cs.CV 2026-04-10 reviewed

Trajectories align motion and sound in AV generation
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

Junchao Liao +7
cs.SD 2026-04-10 reviewed

Hierarchical model generates vocal accompaniments matching SOTA
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation

Jian Zhu +4
cs.CV 2026-04-09 reviewed

Compact model beats most VLMs on explainable sensitive content
SenBen: Sensitive Scene Graphs for Explainable Content Moderation

Fatih Cagatay Akyon +1
cs.MM 2026-04-09 reviewed

Fine-tuned LLMs translate QoS to QoE and back with strong accuracy
QoS-QoE Translation with Large Language Model

Yingjie Yu +5
cs.CV 2026-04-09 reviewed

SemJudge judges AI art by symbolic and indexical meaning
On Semiotic-Grounded Interpretive Evaluation of Generative Art

Ruixiang Jiang +1
eess.IV 2026-04-09 reviewed

INR conditioning lifts perceptual quality at under 0.05 bpp
DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning

Eren \c{C}etin +5
cs.CR 2026-04-09 reviewed

Multimodal model adds readable explanations to encrypted traffic
Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

Longgang Zhang +3
eess.IV 2026-04-09 reviewed

HEVC ROI encryption reaches exact 8x8 coding-unit precision
A H.265/HEVC Fine-Grained ROI Video Encryption Algorithm Based on Coding Unit and Prompt Segmentation

Xiang Zhang +6
cs.CV 2026-04-09 reviewed

UAV dataset with 6-DoF paths improves world model 3D predictions
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

Zile Guo +6
cs.CV 2026-04-09 reviewed

Model turns audio into real-time character videos with stable identity
LPM 1.0: Video-based Character Performance Model

Ailing Zeng +24
cs.CV 2026-04-09 reviewed

Cross-modal attention improves audio-visual deepfake detection
MSCT: Differential Cross-Modal Attention for Deepfake Detection

Fangda Wei +5
cs.CV 2026-04-08 reviewed

Vision models give inconsistent cultural metadata from images
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Yuechen Jiang +6
cs.IR 2026-04-08 reviewed

Benchmark tests AI on comparing music across track pairs
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering

Junyoung Koh +7
cs.HC 2026-04-08 reviewed

Multimodal signals outperform video for predicting driver automation transitions
BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving

Yuhang Wang +5
cs.CV 2026-04-08 reviewed

SurFITR dataset shows forgery detectors fail on surveillance scenes
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

Qizhou Wang +2
cs.MM 2026-04-08 reviewed

Benchmark of 1000 real cases tests AI lung-cancer reasoning
LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment

Fangyu Hao +16
cs.CY 2026-04-08 reviewed

AI tools turn course notes into useful videos for EAP students
AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

David James Woo +2
cs.CV 2026-04-08 reviewed

Uncertainty Gaussians improve text-image sarcasm detection
URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

Zhenyu Wang +5
cs.HC 2026-04-07 reviewed

Text prompts generate matching haptic and visual textures
Language-Guided Multimodal Texture Authoring via Generative Models

Wanli Qian +4
cs.LG 2026-04-07 reviewed

Graph embeddings flag microservice anomalies missed by load tests
From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

Srinidhi Madabhushi +5
cs.CV 2026-04-07 reviewed

Paired food photos let vision models estimate exact consumption
DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images

Gautham Vinod +3
cs.CV 2026-04-07 reviewed

Graph structure improves part-based image coherence
Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

Junbin Zhang +4
cs.MM 2026-04-07 reviewed

Shared prototype space refines multimodal sentiment predictions
Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis

Chen Su +2
cs.CV 2026-04-07 reviewed

Benchmark localizes hallucinations at token level in 200-word captions
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

Xinran Wang +9
cs.CV 2026-04-07 reviewed

Bounding boxes lock composed queries to exact instances
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

Yuxin Yang +8
cs.MM 2026-04-07 reviewed

Edge model gates MLLM calls to cut video alert delay 77 percent
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems

Qi Guo +4
eess.IV 2026-04-07 reviewed

Channel importance boosts machine vision codec performance
CI-ICM: Channel Importance-driven Learned Image Coding for Machines

Yun Zhang +5
cs.MM 2026-04-07 reviewed

LLM pipeline creates STEM animations that raise test scores over slides
LLM2Manim: Pedagogy-Aware AI Generation of STEM Animations

Aastha Joshi +5
cs.MA 2026-04-06 reviewed

Coordination mechanism improves music video edits
GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

Zihao Lin +9
cs.CV 2026-04-06 reviewed

Multi-agent system creates coherent video mashups
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

Ke Li +6
cs.CV 2026-04-06 reviewed

Event overlay lifts robot pick success from 0% to 90% in dark
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Jiajun Zhai +4
eess.IV 2026-04-06 reviewed

Semantic priors from transformers reduce depth boundary artifacts
NAIMA: Semantics Aware RGB Guided Depth Super-Resolution

Tayyab Nasir +2
cs.CV 2026-04-06 reviewed

New diffusion model turns music into editable 3D conductor motions
BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

Tianzhi Jia +6
cs.SD 2026-04-06 reviewed

Model generates complete audio scenes with speech from video and text
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Weiguo Pian +6
cs.CL 2026-04-04 reviewed

LightThinker++ cuts LLM peak tokens by 70% while raising accuracy
LightThinker++: From Reasoning Compression to Memory Management

Yuqi Zhu +9
cs.CV 2026-04-04 reviewed

Denoising-stage filter cuts image check time by up to 79 percent
EDGE-Shield: Efficient Denoising-staGE Shield for Violative Content Filtering via Scalable Reference-Based Matching

Takara Taniguchi +5
cs.CV 2026-04-03 reviewed

Dual-domain edges lift UAV detection to 36.8 AP
SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection

Wenfeng Zhang +6
eess.IV 2026-04-03 reviewed

New dataset shows foreground degradations drive AR quality
ARIQA-3DS: A Stereoscopic Image Quality Assessment Dataset for Realistic Augmented Reality

Aymen Sekhri +2
cs.CV 2026-04-03 reviewed

Middle-layer evidence cuts video model hallucinations
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

Linfeng Fan +3
cs.CV 2026-04-03 reviewed

SentiAvatar turns speech into real-time 3D avatar gestures and expressions
SentiAvatar: Towards Expressive and Interactive Digital Humans

Chuhao Jin +7
eess.IV 2026-04-03 reviewed

Streaming 3D Gaussians improves viewpoint flexibility over video
Streaming Real-Time Rendered Scenes as 3D Gaussians

Matti Siekkinen +1
cs.CV 2026-04-03 reviewed

PaveBench adds interactive QA to pavement distress benchmarks
PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis

Dexiang Li +5
cs.MM 2026-04-03 reviewed

Psychology stimuli improve distinction of mental disorders
Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli

Zhiyuan Zhou +11
cs.CV 2026-04-03 reviewed

AI now builds video trailers instead of selecting clips
Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity

Abhishek Dharmaratnakar +3
cs.CV 2026-04-03 reviewed

Smart Transfer enables fast earthquake damage maps from satellite images
Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery

Hao Li +6
cs.CV 2026-04-02 reviewed

Text descriptions locate urban positions on OSM tiles to meter accuracy
TOL: Textual Localization with OpenStreetMap

Youqi Liao +8