archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 2

cs.MM 2026-05-12 reviewed

3B omni model matches 30B on clean benchmarks
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu +6
cs.MM 2026-05-12 reviewed

3B omni-model matches 30B on debiased benchmarks
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu +6
cs.IR 2026-05-12 reviewed

ZipRerank matches top multimodal rerankers at 10x lower latency
Very Efficient Listwise Multimodal Reranking for Long Documents

Yiqun Sun +2
cs.IR 2026-05-12 reviewed

Critic and generator agents iteratively refine research outlines
AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents

Jiarui Jin +4
cs.MM 2026-05-12 reviewed

Adaptive path choice lifts unified multimodal reasoning
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

Hayes Bai +4
cs.CV 2026-05-11 reviewed

Unified transformer generates images from raw pixels without VAEs
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Qi Cai +24
cs.MM 2026-05-11 reviewed

Targeted head boost cuts hallucinations in vision-language models
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

Yangneng Chen +6
cs.MM 2026-05-11 reviewed

Benchmark shows AI models struggle with evidence in multimodal fact-checking
RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

Danni Xu +3
cs.MM 2026-05-11 reviewed

New benchmark links social posts to fact-check evidence for model testing
RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

Danni Xu +3
cs.MM 2026-05-11 reviewed

User queries alter video retrieval model behavior
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

Qijie You +6
eess.IV 2026-05-11 reviewed

Tube packages stabilize video recovery faster in semantic HARQ
Tube-Structured Incremental Semantic HARQ for Generative Video Receivers

Xuesong Wang +2
eess.IV 2026-05-10 reviewed

Neural preprocessor lifts H.264 perceptual scores 27 percent on UVG
Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG

Marco Graziano
cs.CV 2026-05-10 reviewed

Multi-scale supervision cuts pose errors in sign animation
KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

Guanyi Du +3
eess.IV 2026-05-10 reviewed

Multi-layer CLIP similarities predict machine image preferences
ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality

Feng Ding +5
cs.MM 2026-05-10 reviewed

Dual pathways fix conflicts in text-video-audio intent recognition
Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition

Yifan Wang +4
cs.CV 2026-05-10 reviewed

Invariant relations to known prototypes turn GCD into reliable pattern matching
Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery

Yulin Xu +3
cs.AI 2026-05-10 reviewed

Three agents refine knowledge to lift few-shot time series classification
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

Lin Li +9
cs.AI 2026-05-10 reviewed

Three-agent system lifts VLM accuracy on few-shot time series tasks
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

Lin Li +9
cs.CL 2026-05-10 reviewed

Home activity benchmark shows AI question-answering gaps
HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities

Shusaku Egami +7
cs.GR 2026-05-10 reviewed

Color-adaptive scheme raises 3D Gaussian streaming quality 5-20 dB
CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting

Daheng Yin +9
cs.CV 2026-05-09 reviewed

Gaussian splatting relights VP scenes by sampling LED backgrounds directly
Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

Adrian Azzarelli +3
eess.IV 2026-05-09 reviewed

Neural network adapts frame rate and resolution for better streamed graphics
Streaming of rendered content with adaptive frame rate and resolution

Yaru Liu +2
cs.MM 2026-05-09 reviewed

Edge offloading and pruning cut multi-condition T2I latency by 25%
Accelerating Multi-Condition T2I Generation via Adaptive Condition Offloading and Pruning

Yuxin Kong +4
cs.CV 2026-05-09 reviewed

Unison aligns motion, speech and sound in video generation
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Shihao Cheng +8
cs.CV 2026-05-09 reviewed

Uni-modal focus sharpens weakly supervised AVVP
EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Huilai Li +5
eess.IV 2026-05-09 reviewed

Thin clients stream interactive 3D Gaussian Splatting over HTTP/3
Thin-Client Interactive Gaussian Adaptive Streaming over HTTP/3

Emanuele Artioli +6
cs.MM 2026-05-08 reviewed

Anisotropic correction fixes modality gaps for unpaired training
Anisotropic Modality Align

Xiaomin Yu +10
cs.MM 2026-05-08 reviewed

Multimedia benchmark shows access method guides terminal agent workflows
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

Chiyeong Heo +6
cs.SD 2026-05-08 reviewed

Decomposed stages yield better chord variety and rules compliance
A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

Qiqi He +3
cs.CR 2026-05-08 reviewed

Honeywell deleted videos remain recoverable
Forensic analysis of video data deletion and recovery in Honeywell surveillance file system

Jinhee Yoon +1
cs.GR 2026-05-08 reviewed

Semantic codebook creates style-matched co-speech gestures
PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation

Junchuan Zhao +2
cs.SD 2026-05-08 reviewed

Audio-video models fail to keep physics consistent in transitions
Do Joint Audio-Video Generation Models Understand Physics?

Zijun Cui +10
cs.CL 2026-05-07 reviewed

MIST benchmark shows LLMs lag on voice IoT tasks
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Maximillian Chen +5
cs.CV 2026-05-07 reviewed

Benchmark shows little progress in multimodal domain generalization
Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Hao Dong +5
eess.IV 2026-05-07 reviewed

Neural codec with FFT encoder outperforms tokenizers on sensors
LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

Dan Jacobellis +1
cs.MM 2026-05-07 reviewed

Contrastive and uncertainty methods improve emotion recognition
Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

Yan Zhuang +4
cs.CV 2026-05-07 reviewed

The paper introduces Holmes, a hierarchical evidential learning method for retrieving…
Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval

Jun Li +7
cs.CV 2026-05-07 reviewed

LLM and RL coupling with VR feedback creates adaptive 3D scenes
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

Anh H. Vo +4
cs.MM 2026-05-06 reviewed

Dual paths learn when to fuse or drop modalities in emotion recognition
To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Yangchen Yu +7
cs.SD 2026-05-05 reviewed

0.1B omni model reaches 0.09 CER in speech-text consistency
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Jingyao Gong
cs.CV 2026-05-05 reviewed

Conformal loop self-calibrates multimodal models on noisy
Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

Xun Jiang +7
cs.MM 2026-05-05 reviewed

Imitation learning splits music colors across multiple stage lights
Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

Zijian Zhao +3
cs.SD 2026-05-05 reviewed

Aesthetic features lift AI music preference prediction on unseen generators
APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

Jaavid Aktar Husain +1
cs.CV 2026-05-05 reviewed

Dual-system refines scores to boost self-supervised forgery detectors
Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

Ke Liu +7
cs.LG 2026-05-05 reviewed

Dimension-aware quantiles stabilize multimodal graph unlearning
Stable Multimodal Graph Unlearning via Feature-Dimension Aware Quantile Selection

Jingjing Zhou +7
cs.MM 2026-05-04 reviewed

Reservoir of k streams bounds uptime by harmonic number
The Streaming Reservoir Convergence Theorem: A Prospect-Theoretic Framework for Multi-Provider Adaptive Streaming

Justice Owusu Agyemang +6
cs.MM 2026-05-04 reviewed

CPR restores periodic structure from locally private time series
Period-conscious Time-series Reconstruction under Local Differential Privacy

Yaxuan Wang +4
cs.SD 2026-05-04 reviewed

Offline distillation from DP teacher prevents collapse in private speech classifiers
Private Speech Classification without Collapse: Stabilized DP Training and Offline Distillation

Yadi Wen +4
cs.CV 2026-05-04 reviewed

Video search now returns multiple moments or none
Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval

Yiming Ding +6
cs.MM 2026-05-03 reviewed

Nine systems compete in revived expressive piano rendering contest
RenCon 2025: Revival of the Expressive Performance Rendering Competition

Huan Zhang +9