archive
Every paper Pith has read. Search by title, abstract, or pith.
378 papers in cs.MM · page 7
-
Bioart clusters into four patterns across 13 dimensions
BioArtlas: Computational Clustering of Multi-Dimensional Complexity in Bioart
-
VLM generates music from images with no training or fine-tuning
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
-
Dual-path diffusion sharpens lip sync and head poses
KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation
-
Text Slider makes concept control 5x faster in diffusion models
Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters
-
Model turns video into object-aware stereo sound
StereoFoley: Object-Aware Stereo Audio Generation from Video
-
Medical LLM benchmarks fail clinical and safety checks
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
-
Dataset supplies 14,187 lifelog Q&A pairs from personal data
OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset
-
One model restores and masters music from text instructions
SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering
-
Reasoning benchmark lets models explain time series anomalies
Time-RA: Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback
-
Agentic framework cuts missing modality errors by 14 percent
How Far Are We from Generating Missing Modalities with Foundation Models?
-
17B sparse DiT generates SOTA images in seconds
HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
-
Fixed decoder raises audio steganography quality by over 10 dB
FGAS: Fixed Decoder Network-Based Audio Steganography with Adversarial Perturbation Generation
-
Confidence signals guide attention to cut hallucinations
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
-
Tailored designs succeed on music AVQA where general models struggle
Music Audio-Visual Question Answering Requires Specialized Multimodal Designs
-
RL unlocks autonomous reasoning in text-to-image models
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
-
Dual encoders let LLM score video quality with context plus pixel detail
Context and Pixel Aware Large Language Model for Video Quality Assessment
-
Simulated intent data trains models to spot news deception
Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
-
Drifted CLIP concepts improve meme metaphor detection at lower cost
Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification
-
Random linear map turns audio embeddings into dynamic visuals
LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2
-
Dynamic snapshots raise CRLM survival prediction accuracy
A Dynamic Prognostic Prediction Method for Colorectal Cancer Liver Metastasis
-
Open audio model hits state-of-the-art on speech and conversation benchmarks
Kimi-Audio Technical Report
-
VLM agreement entropy flags OCR errors without labels
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
-
Clustering evidence into narratives improves fact-checking
Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis
-
Collaborative calibration boosts KB-VQA accuracy by 4.7%
Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
-
Edge criteria halve MACs for 8K super-resolution at 30 FPS
ESSR: An 8K@30FPS Super-Resolution Accelerator With Edge Selective Network
-
One model turns text, video or audio prompts into sound
AudioX: A Unified Framework for Anything-to-Audio Generation
-
Video-to-IMU distillation matches supervised HAR without labels
COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
-
Reinforcement learning induces reasoning for image segmentation
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
-
Cross-attention generates context-aware human poses in scenes
Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation
-
Cloud gaming system serves twice as many users at higher quality
Stimpack: An Adaptive Rendering Optimization System for Scalable Cloud Gaming
-
Spatially varying 2D Gaussians beat single-color 3D ones on view synthesis
SVGS: Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors
-
Both global and shared position IDs align video text and speech
Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis
-
Model predicts video quality curves for adaptive transcoding
Content-Adaptive Rate-Quality Curve Prediction Model in Media Processing System
-
Slide text cues extract target speaker from mixed audio
pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues
-
Survey splits document parsing into pipelines and VLM models
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
-
Equivariant transformer beats prototype on chord accompaniment
Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment
-
Model predicts live streaming QoE from video features alone
Subjective and Objective Quality-of-Experience Evaluation Study for Live Video Streaming
-
LLM subject boost raises T2I consistency on complex captions
ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis
-
3D Gaussian splatting delivers real-time explicit rendering
A Survey on 3D Gaussian Splatting
-
New tokenizer lets LLMs beat diffusion models on visuals
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
-
Negative instructions reduce hallucinations in multi-modal models
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
-
Diffusion model rebuilds images from noisy compressed semantics
Generative Semantic Communication: Diffusion Models Beyond Bit Recovery
-
LVLMs describe objects missing from the image
Evaluating Object Hallucination in Large Vision-Language Models
-
LLaMA-Adapter V2 adds visual instructions to LLaMA with 14M parameters
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
-
LLaMA-Adapter tunes 7B model with 1.2M parameters
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
-
Small adapters add precise control to frozen text-to-image models
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
-
ControlNet adds edge, pose and depth controls to diffusion image models
Adding Conditional Control to Text-to-Image Diffusion Models
-
CoCa hits 91% ImageNet by joint contrastive and caption training
CoCa: Contrastive Captioners are Image-Text Foundation Models
-
Instance-specific modality selection beats full set for multi-label tasks
Many could be better than all: A novel instance-oriented algorithm for Multi-modal Multi-label problem
-
Inverse CRF corrects color in multi-exposure fused images
A Color Compensation Method Using Inverse Camera Response Function for Multi-exposure Image Fusion