archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 1

cs.MM 2026-05-22 reviewed

Swarical localizes flying light specks twice as fast
Swarical: An Integrated Hierarchical Approach to Localizing Flying Light Specks

Hamed Alimohammadzadeh +1
cs.CV 2026-05-22 reviewed

Adaptive search fixes blind spots in high-res image perception for LLMs
CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

Liupeng Li +6
cs.GR 2026-05-22 reviewed

Sketches control long video generation via independent shots
DrawVideo: Generating Long Video from Storyboard Keyframe Sketches

Chuanzhi Xu +9
cs.CV 2026-05-22 reviewed

Semantic scores trigger early stops in video motion search
FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis

Kakia Panagidi +1
cs.SD 2026-05-22 reviewed

Multi-stream prompts cut deepfake errors in mixed audio
MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio

Qingcao Li +5
cs.SD 2026-05-21 reviewed

Diffusion models match discrete models for live music
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

Zachary Novack +10
cs.CV 2026-05-21 reviewed

Sparse autoencoder links reasoning steps to image masks
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

Zhenyu Lu +6
cs.CV 2026-05-21 reviewed

Unified model handles many fashion search types at once
FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning

Haokun Wen +5
cs.CV 2026-05-21 reviewed

MLLM planner in ViT space guides DiT to SOTA video generation and edits
Bernini: Latent Semantic Planning for Video Diffusion

Bernini Team: Chenchen Liu +10
cs.CV 2026-05-21 reviewed

Multi-grained compression lifts long video QA accuracy
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

Junbin Xiao +4
cs.CR 2026-05-21 reviewed

Proxy reorders API keys to embed traceable watermarks
PEMark: Watermarking API Responses Based on Proxy Gateways and Position Encoding

Yifei Zhou +4
cs.MM 2026-05-20 reviewed

Review groups LLM multimodal emotion studies into three directions
Multimodal Emotion Recognition with Large Language Models

Hongrui Zhang +6
cs.CR 2026-05-20 reviewed

Framework turns AI detection metrics into legal evidence thresholds
Verifiable Provenance and Watermarking for Generative AI: An Evidentiary Framework for International Operational Law and Domestic Courts

Gustav Olaf Yunus Laitinen-Fredriksson Lundstr\"om-Imanov +1
cs.MM 2026-05-19 reviewed

AI turns I-Ching coin casts into meaning-driven music
Music of Changing Lines: Toward a Culturally Situated Approach to the I-Ching

Ling Qi +2
cs.LG 2026-05-19 reviewed

Mixture of experts catches text-camouflaged fraud in graphs
CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection

Junjun Pan +5
eess.IV 2026-05-19 reviewed

VVC techniques speed up partitioning but adapt to each VTM update
Partition Tree Search Acceleration for VVC: Survey and Evaluation with VTM Evolution

M.E.A. Kherchouche +4
eess.IV 2026-05-19 reviewed

Set shaping cuts steganography KL divergence by 25 percent
Set Shaping Theory as a Complementary Payload-Shaping Layer for Steganography

Aida Koch +3
cs.SD 2026-05-19 reviewed

Scaled simulations cut speech recognition errors over 30 percent
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Zhifei Xie +6
eess.IV 2026-05-19 reviewed

TADA adapts steganalysis to unknown JPEG pipelines
Tackle CSM in JPEG Steganalysis with Data Adaptation

Rony Abecidan (CRIStAL) +5
eess.IV 2026-05-19 reviewed

Semantic system cuts wireless video bandwidth by up to 75%
Perception-Aware Video Semantic Communication

Yinhuan Huang +1
cs.CV 2026-05-19 reviewed

Post-training lifts video models' physical consistency
PhyWorld: Physics-Faithful World Model for Video Generation

Pu Zhao +12
cs.CV 2026-05-18 reviewed

Self-supervised backbones boost artwork classification
Harnessing Self-Supervised Features for Art Classification

Federico Melis +4
cs.MM 2026-05-18 reviewed

Open-web context needed to forecast micro-video virality
Will It Go Viral? Grounding Micro-Video Popularity Prediction on the Open Web

Ryang Heo +1
eess.IV 2026-05-18 reviewed

Unpredictable motion destabilizes compressed video more than high motion
Evaluating the Effect of Compression on Video Temporal Consistency Using Objective Quality Metrics

Peter Zsoldos
eess.IV 2026-05-18 reviewed

Triplane features adapt to standard codecs for better volumetric compression
CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery

Tung-I Chen +3
cs.IR 2026-05-18 reviewed

Dynamic modulation replaces static IDs in multimodal recommendations
Modality-Aware Identity Construction and Counterfactual Structure Learning for ID-Free Multimodal Recommendation

Hongjian Ma +4
eess.IV 2026-05-18 reviewed

Inter-frame learning cuts bits for LiDAR geometry
Inter-LPCM: Learning-based Inter-Frame Predictive Coding for LiDAR Point Cloud Compression

Chang Sun +5
cs.MM 2026-05-18 reviewed

Two-phase sampling matches contradictory audio prompts to video
CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

Gyubin Lee +2
cs.CV 2026-05-17 reviewed

Framework binds faces and voices for consistent audio-video generation
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

Yuheng Chen +6
cs.CV 2026-05-17 reviewed

EchoSR runs lightweight super-resolution twice as fast with better quality
EchoSR: Efficient Context Harnessing for Lightweight Image Super-Resolution

Hanli Zhao +5
cs.SD 2026-05-17 reviewed

Optimal transport matches note distributions for piano transcription
A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

Weixing Wei +3
cs.IR 2026-05-17 reviewed

Dual model generates fashion images with text explanations
Dual-Diffusional Generative Fashion Recommendation

Mingzhe Yu +3
cs.GR 2026-05-16 reviewed

Single compressed atlas drives better immersive video
A Single Atlas is All You Need: Decoder-Side Gaussian Splatting for Immersive Video

Dawid Mieloch +1
cs.GR 2026-05-16 reviewed

Multi-agent loop lifts brand video yield to 89%
Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation

Debanshu Das +3
eess.IV 2026-05-16 reviewed

Legacy GPUs power real-time 8K60 for connected vehicles
Sustainable Real-Time 8K60 HEVC Encoding for V2X: Repurposing Legacy NVENC Hardware at the Vehicular Edge

Kasidis Arunruangsirilert +1
cs.CR 2026-05-15 reviewed

Logistic-map encryption plus Huffman compression handles large videos in one step
A Method for Securely Transmitting Large Video Files Using Chaotic Compression and Encryption

Shiladitya Bhattacharjee +4
eess.IV 2026-05-15 reviewed

AV2 cuts video bitrates nearly 30 percent vs AV1
Video Quality Evaluation Methodology and Result of AV2 Compression Performance

Zhijun Lei +4
eess.IV 2026-05-15 reviewed

Live streams switch resolutions on the fly to save 9% bitrate
Dynamic resolution switching for live streaming

Xin Xiong +4
cs.CV 2026-05-14 reviewed

t-FCW graphs classify point clouds in 7 seconds on GPU
A Unified Non-Parametric and Interpretable Point Cloud Analysis via t-FCW Graph Representation

Haijian Lai +6
cs.GR 2026-05-14 reviewed

Audio and text tuning enables motion edits in video models
Sound Sparks Motion: Audio and Text Tuning for Video Editing

AmirHossein Naghi Razlighi +4
cs.SD 2026-05-14 reviewed

SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

KiHyun Nam +4
cs.CV 2026-05-14 reviewed

Two-stage model fuses radar and satellite for sharper rain forecasts
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

Chunlei Shi +8
cs.CV 2026-05-14 reviewed

RC metrics align object removal scores with human perception
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

Fuhao Li +8
cs.MM 2026-05-14 reviewed

Multi-agent system resolves multimedia claims into editable reports
Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification

Truong Thanh Hung Nguyen +5
cs.CV 2026-05-14 reviewed

Delta Forcing curbs drift in interactive video generation
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Yuheng Wu +6
cs.CV 2026-05-14 reviewed

Trust region limits teacher bias in autoregressive video generation
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Yuheng Wu +6
cs.CV 2026-05-14 reviewed

Delta Forcing steers video generators to stay consistent after events
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Yuheng Wu +6
cs.CV 2026-05-13 reviewed

Few channels control entire DiT image generation
Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

Evelyn Turri +4
cs.CV 2026-05-13 reviewed

Backbone knowledge alone fools frozen deepfake detectors
Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics

Chiara Musso +3
cs.MA 2026-05-12 reviewed

Synthetic dataset benchmarks AI for swim coaching
Synthesizing the Expert: A Validated Multimodal Dataset for Trustworthy AI-Assisted Swimming Coaching

Ahmad Al-Kabbany +1