archive
Every paper Pith has read. Search by title, abstract, or pith.
378 papers in cs.MM · page 1
-
Swarical localizes flying light specks twice as fast
Swarical: An Integrated Hierarchical Approach to Localizing Flying Light Specks
-
Adaptive search fixes blind spots in high-res image perception for LLMs
CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception
-
Sketches control long video generation via independent shots
DrawVideo: Generating Long Video from Storyboard Keyframe Sketches
-
Semantic scores trigger early stops in video motion search
FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis
-
Multi-stream prompts cut deepfake errors in mixed audio
MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio
-
Diffusion models match discrete models for live music
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
-
Sparse autoencoder links reasoning steps to image masks
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
-
Unified model handles many fashion search types at once
FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning
-
MLLM planner in ViT space guides DiT to SOTA video generation and edits
Bernini: Latent Semantic Planning for Video Diffusion
-
Multi-grained compression lifts long video QA accuracy
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
-
Proxy reorders API keys to embed traceable watermarks
PEMark: Watermarking API Responses Based on Proxy Gateways and Position Encoding
-
Review groups LLM multimodal emotion studies into three directions
Multimodal Emotion Recognition with Large Language Models
-
Framework turns AI detection metrics into legal evidence thresholds
Verifiable Provenance and Watermarking for Generative AI: An Evidentiary Framework for International Operational Law and Domestic Courts
-
AI turns I-Ching coin casts into meaning-driven music
Music of Changing Lines: Toward a Culturally Situated Approach to the I-Ching
-
Mixture of experts catches text-camouflaged fraud in graphs
CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection
-
VVC techniques speed up partitioning but adapt to each VTM update
Partition Tree Search Acceleration for VVC: Survey and Evaluation with VTM Evolution
-
Set shaping cuts steganography KL divergence by 25 percent
Set Shaping Theory as a Complementary Payload-Shaping Layer for Steganography
-
Scaled simulations cut speech recognition errors over 30 percent
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
-
TADA adapts steganalysis to unknown JPEG pipelines
Tackle CSM in JPEG Steganalysis with Data Adaptation
-
Semantic system cuts wireless video bandwidth by up to 75%
Perception-Aware Video Semantic Communication
-
Post-training lifts video models' physical consistency
PhyWorld: Physics-Faithful World Model for Video Generation
-
Self-supervised backbones boost artwork classification
Harnessing Self-Supervised Features for Art Classification
-
Open-web context needed to forecast micro-video virality
Will It Go Viral? Grounding Micro-Video Popularity Prediction on the Open Web
-
Unpredictable motion destabilizes compressed video more than high motion
Evaluating the Effect of Compression on Video Temporal Consistency Using Objective Quality Metrics
-
Triplane features adapt to standard codecs for better volumetric compression
CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery
-
Dynamic modulation replaces static IDs in multimodal recommendations
Modality-Aware Identity Construction and Counterfactual Structure Learning for ID-Free Multimodal Recommendation
-
Inter-frame learning cuts bits for LiDAR geometry
Inter-LPCM: Learning-based Inter-Frame Predictive Coding for LiDAR Point Cloud Compression
-
Two-phase sampling matches contradictory audio prompts to video
CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation
-
Framework binds faces and voices for consistent audio-video generation
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
-
EchoSR runs lightweight super-resolution twice as fast with better quality
EchoSR: Efficient Context Harnessing for Lightweight Image Super-Resolution
-
Optimal transport matches note distributions for piano transcription
A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport
-
Dual model generates fashion images with text explanations
Dual-Diffusional Generative Fashion Recommendation
-
Single compressed atlas drives better immersive video
A Single Atlas is All You Need: Decoder-Side Gaussian Splatting for Immersive Video
-
Multi-agent loop lifts brand video yield to 89%
Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation
-
Legacy GPUs power real-time 8K60 for connected vehicles
Sustainable Real-Time 8K60 HEVC Encoding for V2X: Repurposing Legacy NVENC Hardware at the Vehicular Edge
-
Logistic-map encryption plus Huffman compression handles large videos in one step
A Method for Securely Transmitting Large Video Files Using Chaotic Compression and Encryption
-
AV2 cuts video bitrates nearly 30 percent vs AV1
Video Quality Evaluation Methodology and Result of AV2 Compression Performance
-
Live streams switch resolutions on the fly to save 9% bitrate
Dynamic resolution switching for live streaming
-
t-FCW graphs classify point clouds in 7 seconds on GPU
A Unified Non-Parametric and Interpretable Point Cloud Analysis via t-FCW Graph Representation
-
Audio and text tuning enables motion edits in video models
Sound Sparks Motion: Audio and Text Tuning for Video Editing
-
SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
-
Two-stage model fuses radar and satellite for sharper rain forecasts
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
-
RC metrics align object removal scores with human perception
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
-
Multi-agent system resolves multimedia claims into editable reports
Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification
-
Delta Forcing curbs drift in interactive video generation
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
-
Trust region limits teacher bias in autoregressive video generation
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
-
Delta Forcing steers video generators to stay consistent after events
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
-
Few channels control entire DiT image generation
Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers
-
Backbone knowledge alone fools frozen deepfake detectors
Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics
-
Synthetic dataset benchmarks AI for swim coaching
Synthesizing the Expert: A Validated Multimodal Dataset for Trustworthy AI-Assisted Swimming Coaching