archive
Every paper Pith has read. Search by title, abstract, or pith.
378 papers in cs.MM · page 3
-
Context map plus recursive sampling improves semantic video over MIMO-OFDM
Contextual Wireless Video Semantic Communication in MIMO-OFDM Systems
-
MOC-3D fixes Janus faces and texture jumps via CLIP order and SPD distances
MOC-3D: Manifold-Order Consistency for Text-to-3D Generation
-
Correcting fusion bottlenecks lifts AV task performance
Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning
-
Multi-turn queries sharpen health video retrieval
Interactive Multi-Turn Retrieval for Health Videos
-
Confidence scores let AV metrics ignore the bad modality
Multimodal Confidence Modeling in Audio-Visual Quality Assessment
-
Transformer turns music into 3D conducting gestures
MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation
-
Blackwell NVENC UHQ gains quality at 400% latency cost
Evolution of NVENC Efficiency: A Longitudinal Analysis of HQ and UHQ Tuning Efficiency, Latency and Energy Trade-offs
-
PRISM fixes spurious isolation in federated multimodal learning
PRISM: Exposing and Resolving Spurious Isolation in Federated Multimodal Continual Learning
-
This paper introduces TD-Data
CustomDancer: Customized Dance Recommendation by Text-Dance Retrieval
-
EASE severs three anchors to unlearn federated multimodal data
EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure
-
Stable cross-modal alignment flags AI-generated videos
CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
-
Two-stage agents curb dominance and coupling in multimodal learning
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
-
Two-stage agents fix multimodal fusion by selecting useful exchanges
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
-
Interactive visuals help non-experts grasp ML
Towards Interactive Multimodal Representation of ML Functions for Human Understanding of ML
-
Three streaming covariance algorithms match exactly in exact math
$2B$ or Not $2B$: A Tale of Three Algorithms for Streaming: Covariance Estimation after Welford and Chan-Golub-LeVeque
-
KAN fusion raises robocall recall and F1 over baselines
RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System
-
New benchmark makes AVSR considerably harder than LRS3
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
-
ReVo cuts volumetric video freezes by up to 95%
ReVo: A Cross-Layer Reliable Volumetric Videoconferencing System
-
New codec shrinks 3D Gaussian Splatting over 34 times
MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching
-
MesonGS++ compresses 3D Gaussian models over 34 times post-training
MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching
-
5G architecture improves video QoE by 70% via subflow prioritization
StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing
-
5G system boosts video QoE 70% while preserving background traffic
StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing
-
5G video calls gain 70% better quality with new controller
StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing
-
CNN identifies fashion houses at 78% accuracy by texture
FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing
-
Query markers sharpen video model timing without training
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
-
Interpretation cue conditions multimodal predictions on dialogue context
Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding
-
Benchmark separates video models on document evidence tasks
FCMBench-Video: Benchmarking Document Video Intelligence
-
Rebalancing shared-private branches lifts multimodal sentiment accuracy
Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis
-
Hierarchical agents fix drift in AI video stories
Co-Director: Agentic Generative Video Storytelling
-
Meta-CoT raises image editing 15.8% by training on five meta-tasks
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
-
Retrieval index alone flags new species or known ones
DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery
-
Two-stage reasoning hits 80% accuracy on human intention recognition
IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models
-
Images reconstruct uniquely from sparse Laplacian fields
Shared-kernel Wavelet Neural Networks for Poisson Image Reconstruction
-
Real-time avatars stream at 20 FPS with teacher-level sync
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
-
Shared high-level tokens plus separate decoders improve talking audio-video
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
-
Coordinated AI agents generate movies with consistent characters
CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration
-
Adaptive SID learning preserves compatible overlaps for better recs
Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale
-
OceanPile corpus unifies sonar
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
-
Meta-model predicts deep network sample errors from task stats
MetaErr: Towards Predicting Error Patterns in Deep Neural Networks
-
Cascade filters skeletons then verifies semantics in anomaly search
Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search
-
Latent probing detects adult content in videos at 97% F1
Latent Space Probing for Adult Content Detection in Video Generative Models
-
Benchmark shows T2V models strong on objects but weak on actions and audio sync
BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios
-
Beat-guided transformer quantizes MIDI rhythms to scores
Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations
-
Pre-speech eye movements predict memory timing in survivor interviews
Looking Into the Past: Eye Movements Characterize Elements of Autobiographical Recall in Interviews with Holocaust Survivors
-
Calibrated encoders track face identity across artistic styles
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
-
Uncertainty modeling enhances facial action unit detection
UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection
-
360 videos with footprint 3D models create realistic local flood experiences
Realistic Virtual Flood Experience System Using 360{\deg} Videos and 3D City Models Constructed from Building Footprints
-
Benchmark reveals AI music models perceive notation but miss theory
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
-
Semantic transport cuts agent bandwidth 64x for audio
Sema: Semantic Transport for Real-Time Multimodal Agents
-
AttentionBender lets artists apply geometric transforms like rotation and scaling…
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe