archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 6

cs.CV 2026-04-01 reviewed

ProCap separates projections from physical scenes in AR
ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Zimo Cao +3
cs.CV 2026-04-01 reviewed

Hypergraph contrastive learning recovers 3D crowd meshes
Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery

Minghao Sun +4
cs.MM 2026-03-31 reviewed

Movie dialogues train AI for finer multimodal control
From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Zeyu Jin +7
cs.CR 2026-03-23 reviewed

Comic stories bypass safety in multimodal AI models
Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Rui Yang Tan +2
cs.NI 2026-03-22 reviewed

Semantic fields predict gaze in 360 video streams without training
Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields

Aizierjiang Aiersilan +1
cs.CL 2026-03-20 reviewed

OmniTrace traces each output token to supporting input spans
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

Qianqi Yan +6
cs.CL 2026-03-18 reviewed

AI models miss key dental referrals that junior dentists catch
Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage

Ziyi He +9
cs.CL 2026-03-18 reviewed

Korean benchmark caps multimodal models at 42 percent accuracy
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

Nahyun Lee +6
cs.CV 2026-03-15 reviewed

Discrete flow matching improves video dubbing sync
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

Ngoc-Son Nguyen +5
cs.CV 2026-03-11 reviewed

Zero-pair model aligns video to music via event curves
V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin +5
cs.IR 2026-03-07 reviewed

Simple captions lift text-to-video recall rates
Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis

Maria-Eirini Pegia +6
cs.LG 2026-03-04 reviewed

VDCook turns natural language into continuously updating video datasets
VDCook:DIY video data cook your MLLMs

Chengwei Wu
cs.CR 2026-03-02 reviewed

Valid C2PA claim can assert human origin for AI-watermarked image
Authenticated Contradictions from Desynchronized Provenance and Watermarking

Alexander Nemecek +3
cs.CV 2026-03-02 reviewed

Pyramidal memory distills long videos into semantic schemas
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian +7
cs.LG 2026-02-18 reviewed

ModalImmune builds resilience in multimodal models by deliberately collapsing selected…
ModalImmune: Immunity Driven Unlearning via Self Destructive Training

Rong Fu +8
cs.MM 2026-02-18 reviewed

Hyperbolic hypergraphs recover emotions from incomplete multimodal signals
Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection

Rong Fu +7
eess.IV 2026-02-12 reviewed

Gradient maps detect CU steganography in HEVC videos
H.265/HEVC Video Steganalysis Based on CU Block Structure Gradients and IPM Mapping

Xiang Zhang +5
cs.CR 2026-02-09 reviewed

Edge server filters XR frames to protect privacy in cloud AI queries
PRISM-XR: Empowering Privacy-Aware XR Collaboration with Multimodal Large Language Models

Jiangong Chen +2
cs.CV 2026-02-03 reviewed

Label-free system judges AI images by self-built pairs
ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images

Xinyue Li +7
cs.CV 2026-02-02 reviewed

Unpaired text replaces paired image data for MLLM pretraining
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu +14
cs.MM 2026-01-31 reviewed

Benchmark diagnoses failures in multi-talker AI videos
MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation

Yang-Hao Zhou +14
cs.IR 2026-01-30 reviewed

Binary codes match dense vectors in wildlife observation retrieval
Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval

Ilyass Moummad +9
cs.CV 2026-01-27 reviewed

Benchmark tests image editors on bilingual dense documents
VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

Hongzhu Yi +20
cs.MM 2026-01-26 reviewed

Multimodal sensors track developmental shifts in young laying hens
Multimodal Digital Sensing of Early-Life Laying Hens: A Pilot Study Integrating Thermal, Acoustic, Optical-Flow and Environmental Data

Yashan Dhaliwal +2
cs.CV 2026-01-15 reviewed

Multimodal model handles missing data to predict lung cancer survival
Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer

Filippo Ruffini +18
cs.IR 2026-01-13 reviewed

Dynamic benchmark VeriTaS adds claims quarterly to block pretraining leakage
VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking

Mark Rothermel +3
cs.SD 2026-01-06 reviewed

Single model hits top scores on video
Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Yusheng Dai +6
cs.CV 2026-01-06 reviewed

View cone sampling improves 3D saliency maps in VR
Robust Mesh Saliency Ground Truth Acquisition in VR via View Cone Sampling and Manifold Diffusion

Guoquan Zheng +9
cs.CV 2026-01-04 reviewed

Global context plus text yields better few-shot fonts
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

Haonan Cai +2
cs.CV 2026-01-04 reviewed

LinMU matches VLM accuracy with linear complexity
LinMU: Multimodal Understanding Made Linear

Hongjie Wang +1
cs.CV 2026-01-03 reviewed

Layer masking and subspace split generalize deepfake detection
Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition

Xiang Zhang +6
cs.LG 2025-12-28 reviewed

Federated clustering uses tensor low-rank to share client structure
Federated Multi-Task Clustering

Suyan Dai +5
cs.IR 2025-12-26 reviewed

Fusion of LVLM hidden states with IDs beats captions for micro-video recs
Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion

Huatuan Sun +5
cs.MM 2025-12-22 reviewed

Decoupled streams lift AV speaker detection to 95.6% accuracy
Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

Junhao Xiao +8
cs.CV 2025-12-21 reviewed

Video moderation cuts communication 28x with on-device privacy
FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation

Ziyuan Tao +6
cs.MM 2025-12-14 reviewed

Best Omni-LLMs score only 65.3% on joint audio-visual benchmark
JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Jianghan Chao +5
cs.CL 2025-12-08 reviewed

This paper describes a method to add selected speech tokens from an ASR tokenizer into…
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Nicolas Calbucura +2
cs.CV 2025-11-22 reviewed

Synthetic pipeline builds balanced video anomaly benchmark
Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Jie Li +6
cs.CV 2025-11-15 reviewed

Calibration fixes anchor shift from missing modalities
Calibrated Multimodal Representation Learning with Missing Modalities

Xiaohao Liu +6
eess.SP 2025-11-11 reviewed

Diffusion models let receivers rebuild content from tiny semantic cues
Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications

Hai-Long Qin +8
cs.CV 2025-11-07 reviewed

Gaussian split lifts dynamic 3D quality on sparse camera setups
Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges

Adrian Azzarelli +2
cs.CL 2025-11-04 reviewed

Bangladesh AI scores 75-80% on bar exams at under 1% cost
Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice

Azmine Toushik Wasi +3
cs.MM 2025-11-02 reviewed

Webcam gestures turn into continuous music at 30 ms latency
Gesture2Music: A Low-Latency Real-Time Framework for Continuous Gesture-Driven Music Generation

Rathinaraja Jeyaraj +3
cs.AI 2025-10-30 reviewed

Visual keys occupy separate subspace from text keys in MLLMs
MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

Xinhan Zheng +4
cs.NI 2025-10-29 reviewed

Google Drive leads in consistent cloud upload performance on Wi-Fi and LTE
Performance Evaluation of Multimedia Traffic in Cloud Storage Services over Wi-Fi and LTE Networks

Albert Espinal +2
eess.IV 2025-10-20 reviewed

AV1 motion vectors speed up optical flow fourfold
AV1 Motion Vector Fidelity and Application for Efficient Optical Flow

Julien Zouein +2
eess.IV 2025-10-09 reviewed

Unified fusion of frames and sources sharpens remote sensing images
SatFusion: A Unified Framework for Enhancing Remote Sensing Images via Multi-Frame and Multi-Source Images Fusion

Yufei Tong +5
cs.MM 2025-09-30 reviewed

Twin DiT modules generate synced audio and video in one pass
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Chetwin Low +2
cs.CV 2025-09-29 reviewed

Latent space method edits heart rate in videos
Editing Physiological Signals in Videos Using Latent Representations

Tianwen Zhou +3
cs.SD 2025-09-27 reviewed

TV dialogue dataset raises voice role-play scores 38 percent
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

Wenyu Li +4