archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 3

cs.MM 2026-05-03 reviewed

Context map plus recursive sampling improves semantic video over MIMO-OFDM
Contextual Wireless Video Semantic Communication in MIMO-OFDM Systems

Bingyan Xie +5
cs.CV 2026-05-03 reviewed

MOC-3D fixes Janus faces and texture jumps via CLIP order and SPD distances
MOC-3D: Manifold-Order Consistency for Text-to-3D Generation

Chenyang Fan +7
cs.SD 2026-05-03 reviewed

Correcting fusion bottlenecks lifts AV task performance
Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

Xinmeng Xu +5
cs.IR 2026-05-02 reviewed

Multi-turn queries sharpen health video retrieval
Interactive Multi-Turn Retrieval for Health Videos

Chengzheng Wu +5
cs.MM 2026-05-02 reviewed

Confidence scores let AV metrics ignore the bad modality
Multimodal Confidence Modeling in Audio-Visual Quality Assessment

Mayesha Maliha R. Mithila +1
cs.SD 2026-05-02 reviewed

Transformer turns music into 3D conducting gestures
MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation

Ke Qiu +5
eess.IV 2026-05-02 reviewed

Blackwell NVENC UHQ gains quality at 400% latency cost
Evolution of NVENC Efficiency: A Longitudinal Analysis of HQ and UHQ Tuning Efficiency, Latency and Energy Trade-offs

Kasidis Arunruangsirilert +1
cs.MM 2026-05-01 reviewed

PRISM fixes spurious isolation in federated multimodal learning
PRISM: Exposing and Resolving Spurious Isolation in Federated Multimodal Continual Learning

Beining Wu +2
cs.MM 2026-05-01 reviewed

This paper introduces TD-Data
CustomDancer: Customized Dance Recommendation by Text-Dance Retrieval

Yawen Qin +2
cs.NI 2026-05-01 reviewed

EASE severs three anchors to unlearn federated multimodal data
EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure

Zihao Ding +2
cs.CV 2026-05-01 reviewed

Stable cross-modal alignment flags AI-generated videos
CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

Hang Wang +5
cs.LG 2026-05-01 reviewed

Two-stage agents curb dominance and coupling in multimodal learning
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

Chunlei Meng +9
cs.LG 2026-05-01 reviewed

Two-stage agents fix multimodal fusion by selecting useful exchanges
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

Chunlei Meng +9
cs.GR 2026-05-01 reviewed

Interactive visuals help non-experts grasp ML
Towards Interactive Multimodal Representation of ML Functions for Human Understanding of ML

Bokang Wang +6
stat.CO 2026-04-30 reviewed

Three streaming covariance algorithms match exactly in exact math
$2B$ or Not $2B$: A Tale of Three Algorithms for Streaming: Covariance Estimation after Welford and Chan-Golub-LeVeque

Felix Reichel
cs.MM 2026-04-30 reviewed

KAN fusion raises robocall recall and F1 over baselines
RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System

Nitin Choudhury +7
eess.AS 2026-04-30 reviewed

New benchmark makes AVSR considerably harder than LRS3
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Doyeop Kwak +3
cs.NI 2026-04-30 reviewed

ReVo cuts volumetric video freezes by up to 95%
ReVo: A Cross-Layer Reliable Volumetric Videoconferencing System

Ankur Aditya +5
cs.CV 2026-04-29 reviewed

New codec shrinks 3D Gaussian Splatting over 34 times
MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

Shuzhao Xie +12
cs.CV 2026-04-29 reviewed

MesonGS++ compresses 3D Gaussian models over 34 times post-training
MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

Shuzhao Xie +12
cs.NI 2026-04-29 reviewed

5G architecture improves video QoE by 70% via subflow prioritization
StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing

Xuyang Cao +2
cs.NI 2026-04-29 reviewed

5G system boosts video QoE 70% while preserving background traffic
StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing

Xuyang Cao +2
cs.NI 2026-04-29 reviewed

5G video calls gain 70% better quality with new controller
StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing

Xuyang Cao +2
cs.CV 2026-04-29 reviewed

CNN identifies fashion houses at 78% accuracy by texture
FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing

Morayo Danielle Adeyemi +2
cs.MM 2026-04-28 reviewed

Query markers sharpen video model timing without training
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

Pengcheng Fang +2
cs.MM 2026-04-28 reviewed

Interpretation cue conditions multimodal predictions on dialogue context
Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

Zhaoyan Pan +7
cs.CV 2026-04-28 reviewed

Benchmark separates video models on document evidence tasks
FCMBench-Video: Benchmarking Document Video Intelligence

Runze Cui +5
cs.MM 2026-04-28 reviewed

Rebalancing shared-private branches lifts multimodal sentiment accuracy
Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis

Chunlei Meng +6
cs.AI 2026-04-27 reviewed

Hierarchical agents fix drift in AI video stories
Co-Director: Agentic Generative Video Storytelling

Yale Song +15
cs.CV 2026-04-27 reviewed

Meta-CoT raises image editing 15.8% by training on five meta-tasks
Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Shiyi Zhang +10
cs.CV 2026-04-27 reviewed

Retrieval index alone flags new species or known ones
DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Jiawei Wang +10
cs.HC 2026-04-27 reviewed

Two-stage reasoning hits 80% accuracy on human intention recognition
IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models

Hamed Rahimi +4
eess.IV 2026-04-27 reviewed

Images reconstruct uniquely from sparse Laplacian fields
Shared-kernel Wavelet Neural Networks for Poisson Image Reconstruction

Yuanhao Gong +2
cs.CV 2026-04-26 reviewed

Real-time avatars stream at 20 FPS with teacher-level sync
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

Chunyu Li +6
cs.CV 2026-04-26 reviewed

Shared high-level tokens plus separate decoders improve talking audio-video
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Zhen Ye +10
cs.MM 2026-04-26 reviewed

Coordinated AI agents generate movies with consistent characters
CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

Tianyidan Xie +6
cs.IR 2026-04-26 reviewed

Adaptive SID learning preserves compatible overlaps for better recs
Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale

Yongsen Pan +10
cs.MM 2026-04-25 reviewed

OceanPile corpus unifies sonar
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

Yida Xue +7
cs.CV 2026-04-25 reviewed

Meta-model predicts deep network sample errors from task stats
MetaErr: Towards Predicting Error Patterns in Deep Neural Networks

Varun Totakura +1
cs.CV 2026-04-25 reviewed

Cascade filters skeletons then verifies semantics in anomaly search
Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search

Zequn Xie +6
cs.CV 2026-04-25 reviewed

Latent probing detects adult content in videos at 97% F1
Latent Space Probing for Adult Content Detection in Video Generative Models

Alizishaan Khatri +1
cs.MM 2026-04-24 reviewed

Benchmark shows T2V models strong on objects but weak on actions and audio sync
BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

Advait Tilak +3
cs.SD 2026-04-24 reviewed

Beat-guided transformer quantizes MIDI rhythms to scores
Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

Maximilian Wachter +2
cs.MM 2026-04-23 reviewed

Pre-speech eye movements predict memory timing in survivor interviews
Looking Into the Past: Eye Movements Characterize Elements of Autobiographical Recall in Interviews with Holocaust Survivors

Emily Zhou +4
cs.GR 2026-04-23 reviewed

Calibrated encoders track face identity across artistic styles
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

Kwan Yun +5
cs.CV 2026-04-23 reviewed

Uncertainty modeling enhances facial action unit detection
UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection

Yuze Li +1
cs.MM 2026-04-22 reviewed

360 videos with footprint 3D models create realistic local flood experiences
Realistic Virtual Flood Experience System Using 360{\deg} Videos and 3D City Models Constructed from Building Footprints

Tatsuro Banno +4
cs.SD 2026-04-22 reviewed

Benchmark reveals AI music models perceive notation but miss theory
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

Menghe Ma +7
cs.MM 2026-04-22 reviewed

Semantic transport cuts agent bandwidth 64x for audio
Sema: Semantic Transport for Real-Time Multimodal Agents

Jiaying Meng +1
cs.MM 2026-04-22 reviewed

AttentionBender lets artists apply geometric transforms like rotation and scaling…
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

Adam Cole +1