archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 7

cs.IR 2025-09-27 reviewed

Bioart clusters into four patterns across 13 dimensions
BioArtlas: Computational Clustering of Multi-Dimensional Complexity in Bioart

Joonhyung Bae
cs.SD 2025-09-26 reviewed

VLM generates music from images with no training or fine-tuning
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

Zijian Zhao +2
cs.GR 2025-09-24 reviewed

Dual-path diffusion sharpens lip sync and head poses
KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

Tianle Lyu +2
cs.GR 2025-09-23 reviewed

Text Slider makes concept control 5x faster in diffusion models
Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters

Pin-Yen Chiu +2
cs.SD 2025-09-22 reviewed

Model turns video into object-aware stereo sound
StereoFoley: Object-Aware Stereo Audio Generation from Video

Tornike Karchkhadze +6
cs.CL 2025-08-06 reviewed

Medical LLM benchmarks fail clinical and safety checks
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

Wenting Chen +7
cs.MM 2025-08-05 reviewed

Dataset supplies 14,187 lifelog Q&A pairs from personal data
OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset

Quang-Linh Tran +5
cs.SD 2025-08-05 reviewed

One model restores and masters music from text instructions
SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

Jan Melechovsky +3
cs.LG 2025-07-20 reviewed

Reasoning benchmark lets models explain time series anomalies
Time-RA: Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback

Yiyuan Yang +8
cs.MM 2025-06-04 reviewed

Agentic framework cuts missing modality errors by 14 percent
How Far Are We from Generating Missing Modalities with Foundation Models?

Guanzhou Ke +4
cs.CV 2025-05-28 reviewed

17B sparse DiT generates SOTA images in seconds
HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Qi Cai +21
cs.SD 2025-05-28 reviewed

Fixed decoder raises audio steganography quality by over 10 dB
FGAS: Fixed Decoder Network-Based Audio Steganography with Adversarial Perturbation Generation

Jialin Yan +6
cs.CV 2025-05-27 reviewed

Confidence signals guide attention to cut hallucinations
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Mehrdad Fazli +3
cs.SD 2025-05-27 reviewed

Tailored designs succeed on music AVQA where general models struggle
Music Audio-Visual Question Answering Requires Specialized Multimodal Designs

Wenhao You +11
cs.CV 2025-05-22 reviewed

RL unlocks autonomous reasoning in text-to-image models
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan +7
cs.CV 2025-05-21 reviewed

Dual encoders let LLM score video quality with context plus pixel detail
Context and Pixel Aware Large Language Model for Video Quality Assessment

Wen Wen +5
cs.CV 2025-05-21 reviewed

Simulated intent data trains models to spot news deception
Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

Jiaying Wu +4
cs.MM 2025-05-16 reviewed

Drifted CLIP concepts improve meme metaphor detection at lower cost
Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

Wenhao Qian +3
cs.SD 2025-05-15 reviewed

Random linear map turns audio embeddings into dynamic visuals
LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2

Jongmin Jung +1
eess.IV 2025-05-06 reviewed

Dynamic snapshots raise CRLM survival prediction accuracy
A Dynamic Prognostic Prediction Method for Colorectal Cancer Liver Metastasis

Wei Yang +5
eess.AS 2025-04-25 reviewed

Open audio model hits state-of-the-art on speech and conversation benchmarks
Kimi-Audio Technical Report

KimiTeam +39
cs.CV 2025-04-15 reviewed

VLM agreement entropy flags OCR errors without labels
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Yulong Zhang +7
cs.MM 2025-04-14 reviewed

Clustering evidence into narratives improves fact-checking
Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis

Arka Ujjal Dey +4
cs.CV 2025-04-05 reviewed

Collaborative calibration boosts KB-VQA accuracy by 4.7%
Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

Jiaqi Deng +5
cs.AR 2025-03-26 reviewed

Edge criteria halve MACs for 8K super-resolution at 30 FPS
ESSR: An 8K@30FPS Super-Resolution Accelerator With Edge Selective Network

Chih-Chia Hsu +1
cs.MM 2025-03-13 reviewed

One model turns text, video or audio prompts into sound
AudioX: A Unified Framework for Anything-to-Audio Generation

Zeyue Tian +8
cs.CV 2025-03-10 reviewed

Video-to-IMU distillation matches supervised HAR without labels
COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Baiyu Chen +5
cs.CV 2025-03-09 reviewed

Reinforcement learning induces reasoning for image segmentation
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu +6
cs.CV 2025-02-19 reviewed

Cross-attention generates context-aware human poses in scenes
Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

Prasun Roy +4
cs.DC 2024-12-27 reviewed

Cloud gaming system serves twice as many users at higher quality
Stimpack: An Adaptive Rendering Optimization System for Scalable Cloud Gaming

Jin Heo +3
cs.CV 2024-11-28 reviewed

Spatially varying 2D Gaussians beat single-color 3D ones on view synthesis
SVGS: Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors

Rui Xu +9
cs.MM 2024-11-26 reviewed

Both global and shared position IDs align video text and speech
Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

Akshita Gupta +5
cs.MM 2024-11-08 reviewed

Model predicts video quality curves for adaptive transcoding
Content-Adaptive Rate-Quality Curve Prediction Model in Media Processing System

Shibo Yin +6
cs.SD 2024-11-05 reviewed

Slide text cues extract target speaker from mixed audio
pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

Ziyang Jiang +6
cs.MM 2024-10-28 reviewed

Survey splits document parsing into pipelines and VLM models
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Qintong Zhang +7
cs.SD 2024-10-23 reviewed

Equivariant transformer beats prototype on chord accompaniment
Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

Weiliang Luo
cs.MM 2024-09-26 reviewed

Model predicts live streaming QoE from video features alone
Subjective and Objective Quality-of-Experience Evaluation Study for Live Video Streaming

Zehao Zhu +9
cs.CV 2024-04-15 reviewed

LLM subject boost raises T2I consistency on complex captions
ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis

Aashish Anantha Ramakrishnan +2
cs.CV 2024-01-08 reviewed

3D Gaussian splatting delivers real-time explicit rendering
A Survey on 3D Gaussian Splatting

Guikun Chen +1
cs.CV 2023-10-09 reviewed

New tokenizer lets LLMs beat diffusion models on visuals
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu +15
cs.CV 2023-06-26 reviewed

Negative instructions reduce hallucinations in multi-modal models
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu +5
cs.AI 2023-06-07 reviewed

Diffusion model rebuilds images from noisy compressed semantics
Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

Eleonora Grassucci +2
cs.CV 2023-05-17 reviewed

LVLMs describe objects missing from the image
Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li +5
cs.CV 2023-04-28 reviewed

LLaMA-Adapter V2 adds visual instructions to LLaMA with 14M parameters
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao +11
cs.CV 2023-03-28 reviewed

LLaMA-Adapter tunes 7B model with 1.2M parameters
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang +9
cs.CV 2023-02-16 reviewed

Small adapters add precise control to frozen text-to-image models
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Chong Mou +7
cs.CV 2023-02-10 reviewed

ControlNet adds edge, pose and depth controls to diffusion image models
Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang +2
cs.CV 2022-05-04 reviewed

CoCa hits 91% ImageNet by joint contrastive and caption training
CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu +5
cs.LG 2019-07-27 reviewed

Instance-specific modality selection beats full set for multi-label tasks
Many could be better than all: A novel instance-oriented algorithm for Multi-modal Multi-label problem

Yi Zhang +4
eess.IV 2019-07-26 reviewed

Inverse CRF corrects color in multi-exposure fused images
A Color Compensation Method Using Inverse Camera Response Function for Multi-exposure Image Fusion

Artit Visavakitcharoen +2