pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 3

  1. cs.MM 2026-05-03 reviewed
    Context map plus recursive sampling improves semantic video over MIMO-OFDM

    Contextual Wireless Video Semantic Communication in MIMO-OFDM Systems

    Bingyan Xie +5

  2. cs.CV 2026-05-03 reviewed
    MOC-3D fixes Janus faces and texture jumps via CLIP order and SPD distances

    MOC-3D: Manifold-Order Consistency for Text-to-3D Generation

    Chenyang Fan +7

  3. cs.SD 2026-05-03 reviewed
    Correcting fusion bottlenecks lifts AV task performance

    Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

    Xinmeng Xu +5

  4. cs.IR 2026-05-02 reviewed
    Multi-turn queries sharpen health video retrieval

    Interactive Multi-Turn Retrieval for Health Videos

    Chengzheng Wu +5

  5. cs.MM 2026-05-02 reviewed
    Confidence scores let AV metrics ignore the bad modality

    Multimodal Confidence Modeling in Audio-Visual Quality Assessment

    Mayesha Maliha R. Mithila +1

  6. cs.SD 2026-05-02 reviewed
    Transformer turns music into 3D conducting gestures

    MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation

    Ke Qiu +5

  7. eess.IV 2026-05-02 reviewed
    Blackwell NVENC UHQ gains quality at 400% latency cost

    Evolution of NVENC Efficiency: A Longitudinal Analysis of HQ and UHQ Tuning Efficiency, Latency and Energy Trade-offs

    Kasidis Arunruangsirilert +1

  8. cs.MM 2026-05-01 reviewed
    PRISM fixes spurious isolation in federated multimodal learning

    PRISM: Exposing and Resolving Spurious Isolation in Federated Multimodal Continual Learning

    Beining Wu +2

  9. cs.MM 2026-05-01 reviewed
    This paper introduces TD-Data

    CustomDancer: Customized Dance Recommendation by Text-Dance Retrieval

    Yawen Qin +2

  10. cs.NI 2026-05-01 reviewed
    EASE severs three anchors to unlearn federated multimodal data

    EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure

    Zihao Ding +2

  11. cs.CV 2026-05-01 reviewed
    Stable cross-modal alignment flags AI-generated videos

    CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

    Hang Wang +5

  12. cs.LG 2026-05-01 reviewed
    Two-stage agents curb dominance and coupling in multimodal learning

    Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

    Chunlei Meng +9

  13. cs.LG 2026-05-01 reviewed
    Two-stage agents fix multimodal fusion by selecting useful exchanges

    Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

    Chunlei Meng +9

  14. cs.GR 2026-05-01 reviewed
    Interactive visuals help non-experts grasp ML

    Towards Interactive Multimodal Representation of ML Functions for Human Understanding of ML

    Bokang Wang +6

  15. stat.CO 2026-04-30 reviewed
    Three streaming covariance algorithms match exactly in exact math

    $2B$ or Not $2B$: A Tale of Three Algorithms for Streaming: Covariance Estimation after Welford and Chan-Golub-LeVeque

    Felix Reichel

  16. cs.MM 2026-04-30 reviewed
    KAN fusion raises robocall recall and F1 over baselines

    RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System

    Nitin Choudhury +7

  17. eess.AS 2026-04-30 reviewed
    New benchmark makes AVSR considerably harder than LRS3

    LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

    Doyeop Kwak +3

  18. cs.NI 2026-04-30 reviewed
    ReVo cuts volumetric video freezes by up to 95%

    ReVo: A Cross-Layer Reliable Volumetric Videoconferencing System

    Ankur Aditya +5

  19. cs.CV 2026-04-29 reviewed
    New codec shrinks 3D Gaussian Splatting over 34 times

    MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

    Shuzhao Xie +12

  20. cs.CV 2026-04-29 reviewed
    MesonGS++ compresses 3D Gaussian models over 34 times post-training

    MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

    Shuzhao Xie +12

  21. cs.NI 2026-04-29 reviewed
    5G architecture improves video QoE by 70% via subflow prioritization

    StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing

    Xuyang Cao +2

  22. cs.NI 2026-04-29 reviewed
    5G system boosts video QoE 70% while preserving background traffic

    StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing

    Xuyang Cao +2

  23. cs.NI 2026-04-29 reviewed
    5G video calls gain 70% better quality with new controller

    StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing

    Xuyang Cao +2

  24. cs.CV 2026-04-29 reviewed
    CNN identifies fashion houses at 78% accuracy by texture

    FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing

    Morayo Danielle Adeyemi +2

  25. cs.MM 2026-04-28 reviewed
    Query markers sharpen video model timing without training

    MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

    Pengcheng Fang +2

  26. cs.MM 2026-04-28 reviewed
    Interpretation cue conditions multimodal predictions on dialogue context

    Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

    Zhaoyan Pan +7

  27. cs.CV 2026-04-28 reviewed
    Benchmark separates video models on document evidence tasks

    FCMBench-Video: Benchmarking Document Video Intelligence

    Runze Cui +5

  28. cs.MM 2026-04-28 reviewed
    Rebalancing shared-private branches lifts multimodal sentiment accuracy

    Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis

    Chunlei Meng +6

  29. cs.AI 2026-04-27 reviewed
    Hierarchical agents fix drift in AI video stories

    Co-Director: Agentic Generative Video Storytelling

    Yale Song +15

  30. cs.CV 2026-04-27 reviewed
    Meta-CoT raises image editing 15.8% by training on five meta-tasks

    Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    Shiyi Zhang +10

  31. cs.CV 2026-04-27 reviewed
    Retrieval index alone flags new species or known ones

    DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

    Jiawei Wang +10

  32. cs.HC 2026-04-27 reviewed
    Two-stage reasoning hits 80% accuracy on human intention recognition

    IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models

    Hamed Rahimi +4

  33. eess.IV 2026-04-27 reviewed
    Images reconstruct uniquely from sparse Laplacian fields

    Shared-kernel Wavelet Neural Networks for Poisson Image Reconstruction

    Yuanhao Gong +2

  34. cs.CV 2026-04-26 reviewed
    Real-time avatars stream at 20 FPS with teacher-level sync

    Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

    Chunyu Li +6

  35. cs.CV 2026-04-26 reviewed
    Shared high-level tokens plus separate decoders improve talking audio-video

    Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

    Zhen Ye +10

  36. cs.MM 2026-04-26 reviewed
    Coordinated AI agents generate movies with consistent characters

    CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

    Tianyidan Xie +6

  37. cs.IR 2026-04-26 reviewed
    Adaptive SID learning preserves compatible overlaps for better recs

    Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale

    Yongsen Pan +10

  38. cs.MM 2026-04-25 reviewed
    OceanPile corpus unifies sonar

    OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

    Yida Xue +7

  39. cs.CV 2026-04-25 reviewed
    Meta-model predicts deep network sample errors from task stats

    MetaErr: Towards Predicting Error Patterns in Deep Neural Networks

    Varun Totakura +1

  40. cs.CV 2026-04-25 reviewed
    Cascade filters skeletons then verifies semantics in anomaly search

    Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search

    Zequn Xie +6

  41. cs.CV 2026-04-25 reviewed
    Latent probing detects adult content in videos at 97% F1

    Latent Space Probing for Adult Content Detection in Video Generative Models

    Alizishaan Khatri +1

  42. cs.MM 2026-04-24 reviewed
    Benchmark shows T2V models strong on objects but weak on actions and audio sync

    BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

    Advait Tilak +3

  43. cs.SD 2026-04-24 reviewed
    Beat-guided transformer quantizes MIDI rhythms to scores

    Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

    Maximilian Wachter +2

  44. cs.MM 2026-04-23 reviewed
    Pre-speech eye movements predict memory timing in survivor interviews

    Looking Into the Past: Eye Movements Characterize Elements of Autobiographical Recall in Interviews with Holocaust Survivors

    Emily Zhou +4

  45. cs.GR 2026-04-23 reviewed
    Calibrated encoders track face identity across artistic styles

    StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

    Kwan Yun +5

  46. cs.CV 2026-04-23 reviewed
    Uncertainty modeling enhances facial action unit detection

    UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection

    Yuze Li +1

  47. cs.MM 2026-04-22 reviewed
    360 videos with footprint 3D models create realistic local flood experiences

    Realistic Virtual Flood Experience System Using 360{\deg} Videos and 3D City Models Constructed from Building Footprints

    Tatsuro Banno +4

  48. cs.SD 2026-04-22 reviewed
    Benchmark reveals AI music models perceive notation but miss theory

    ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

    Menghe Ma +7

  49. cs.MM 2026-04-22 reviewed
    Semantic transport cuts agent bandwidth 64x for audio

    Sema: Semantic Transport for Real-Time Multimodal Agents

    Jiaying Meng +1

  50. cs.MM 2026-04-22 reviewed
    AttentionBender lets artists apply geometric transforms like rotation and scaling…

    AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

    Adam Cole +1