pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 6

  1. cs.CV 2026-04-01 reviewed
    ProCap separates projections from physical scenes in AR

    ProCap: Projection-Aware Captioning for Spatial Augmented Reality

    Zimo Cao +3

  2. cs.CV 2026-04-01 reviewed
    Hypergraph contrastive learning recovers 3D crowd meshes

    Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery

    Minghao Sun +4

  3. cs.MM 2026-03-31 reviewed
    Movie dialogues train AI for finer multimodal control

    From Natural Alignment to Conditional Controllability in Multimodal Dialogue

    Zeyu Jin +7

  4. cs.CR 2026-03-23 reviewed
    Comic stories bypass safety in multimodal AI models

    Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

    Rui Yang Tan +2

  5. cs.NI 2026-03-22 reviewed
    Semantic fields predict gaze in 360 video streams without training

    Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields

    Aizierjiang Aiersilan +1

  6. cs.CL 2026-03-20 reviewed
    OmniTrace traces each output token to supporting input spans

    OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

    Qianqi Yan +6

  7. cs.CL 2026-03-18 reviewed
    AI models miss key dental referrals that junior dentists catch

    Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage

    Ziyi He +9

  8. cs.CL 2026-03-18 reviewed
    Korean benchmark caps multimodal models at 42 percent accuracy

    KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

    Nahyun Lee +6

  9. cs.CV 2026-03-15 reviewed
    Discrete flow matching improves video dubbing sync

    DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

    Ngoc-Son Nguyen +5

  10. cs.CV 2026-03-11 reviewed
    Zero-pair model aligns video to music via event curves

    V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

    Yan-Bo Lin +5

  11. cs.IR 2026-03-07 reviewed
    Simple captions lift text-to-video recall rates

    Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis

    Maria-Eirini Pegia +6

  12. cs.LG 2026-03-04 reviewed
    VDCook turns natural language into continuously updating video datasets

    VDCook:DIY video data cook your MLLMs

    Chengwei Wu

  13. cs.CR 2026-03-02 reviewed
    Valid C2PA claim can assert human origin for AI-watermarked image

    Authenticated Contradictions from Desynchronized Provenance and Watermarking

    Alexander Nemecek +3

  14. cs.CV 2026-03-02 reviewed
    Pyramidal memory distills long videos into semantic schemas

    From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

    Niu Lian +7

  15. cs.LG 2026-02-18 reviewed
    ModalImmune builds resilience in multimodal models by deliberately collapsing selected…

    ModalImmune: Immunity Driven Unlearning via Self Destructive Training

    Rong Fu +8

  16. cs.MM 2026-02-18 reviewed
    Hyperbolic hypergraphs recover emotions from incomplete multimodal signals

    Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection

    Rong Fu +7

  17. eess.IV 2026-02-12 reviewed
    Gradient maps detect CU steganography in HEVC videos

    H.265/HEVC Video Steganalysis Based on CU Block Structure Gradients and IPM Mapping

    Xiang Zhang +5

  18. cs.CR 2026-02-09 reviewed
    Edge server filters XR frames to protect privacy in cloud AI queries

    PRISM-XR: Empowering Privacy-Aware XR Collaboration with Multimodal Large Language Models

    Jiangong Chen +2

  19. cs.CV 2026-02-03 reviewed
    Label-free system judges AI images by self-built pairs

    ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images

    Xinyue Li +7

  20. cs.CV 2026-02-02 reviewed
    Unpaired text replaces paired image data for MLLM pretraining

    Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

    Xiaomin Yu +14

  21. cs.MM 2026-01-31 reviewed
    Benchmark diagnoses failures in multi-talker AI videos

    MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation

    Yang-Hao Zhou +14

  22. cs.IR 2026-01-30 reviewed
    Binary codes match dense vectors in wildlife observation retrieval

    Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval

    Ilyass Moummad +9

  23. cs.CV 2026-01-27 reviewed
    Benchmark tests image editors on bilingual dense documents

    VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

    Hongzhu Yi +20

  24. cs.MM 2026-01-26 reviewed
    Multimodal sensors track developmental shifts in young laying hens

    Multimodal Digital Sensing of Early-Life Laying Hens: A Pilot Study Integrating Thermal, Acoustic, Optical-Flow and Environmental Data

    Yashan Dhaliwal +2

  25. cs.CV 2026-01-15 reviewed
    Multimodal model handles missing data to predict lung cancer survival

    Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer

    Filippo Ruffini +18

  26. cs.IR 2026-01-13 reviewed
    Dynamic benchmark VeriTaS adds claims quarterly to block pretraining leakage

    VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking

    Mark Rothermel +3

  27. cs.SD 2026-01-06 reviewed
    Single model hits top scores on video

    Omni2Sound: Towards Unified Video-Text-to-Audio Generation

    Yusheng Dai +6

  28. cs.CV 2026-01-06 reviewed
    View cone sampling improves 3D saliency maps in VR

    Robust Mesh Saliency Ground Truth Acquisition in VR via View Cone Sampling and Manifold Diffusion

    Guoquan Zheng +9

  29. cs.CV 2026-01-04 reviewed
    Global context plus text yields better few-shot fonts

    Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

    Haonan Cai +2

  30. cs.CV 2026-01-04 reviewed
    LinMU matches VLM accuracy with linear complexity

    LinMU: Multimodal Understanding Made Linear

    Hongjie Wang +1

  31. cs.CV 2026-01-03 reviewed
    Layer masking and subspace split generalize deepfake detection

    Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition

    Xiang Zhang +6

  32. cs.LG 2025-12-28 reviewed
    Federated clustering uses tensor low-rank to share client structure

    Federated Multi-Task Clustering

    Suyan Dai +5

  33. cs.IR 2025-12-26 reviewed
    Fusion of LVLM hidden states with IDs beats captions for micro-video recs

    Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion

    Huatuan Sun +5

  34. cs.MM 2025-12-22 reviewed
    Decoupled streams lift AV speaker detection to 95.6% accuracy

    Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

    Junhao Xiao +8

  35. cs.CV 2025-12-21 reviewed
    Video moderation cuts communication 28x with on-device privacy

    FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation

    Ziyuan Tao +6

  36. cs.MM 2025-12-14 reviewed
    Best Omni-LLMs score only 65.3% on joint audio-visual benchmark

    JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

    Jianghan Chao +5

  37. cs.CL 2025-12-08 reviewed
    This paper describes a method to add selected speech tokens from an ASR tokenizer into…

    A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

    Nicolas Calbucura +2

  38. cs.CV 2025-11-22 reviewed
    Synthetic pipeline builds balanced video anomaly benchmark

    Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

    Jie Li +6

  39. cs.CV 2025-11-15 reviewed
    Calibration fixes anchor shift from missing modalities

    Calibrated Multimodal Representation Learning with Missing Modalities

    Xiaohao Liu +6

  40. eess.SP 2025-11-11 reviewed
    Diffusion models let receivers rebuild content from tiny semantic cues

    Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications

    Hai-Long Qin +8

  41. cs.CV 2025-11-07 reviewed
    Gaussian split lifts dynamic 3D quality on sparse camera setups

    Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges

    Adrian Azzarelli +2

  42. cs.CL 2025-11-04 reviewed
    Bangladesh AI scores 75-80% on bar exams at under 1% cost

    Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice

    Azmine Toushik Wasi +3

  43. cs.MM 2025-11-02 reviewed
    Webcam gestures turn into continuous music at 30 ms latency

    Gesture2Music: A Low-Latency Real-Time Framework for Continuous Gesture-Driven Music Generation

    Rathinaraja Jeyaraj +3

  44. cs.AI 2025-10-30 reviewed
    Visual keys occupy separate subspace from text keys in MLLMs

    MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

    Xinhan Zheng +4

  45. cs.NI 2025-10-29 reviewed
    Google Drive leads in consistent cloud upload performance on Wi-Fi and LTE

    Performance Evaluation of Multimedia Traffic in Cloud Storage Services over Wi-Fi and LTE Networks

    Albert Espinal +2

  46. eess.IV 2025-10-20 reviewed
    AV1 motion vectors speed up optical flow fourfold

    AV1 Motion Vector Fidelity and Application for Efficient Optical Flow

    Julien Zouein +2

  47. eess.IV 2025-10-09 reviewed
    Unified fusion of frames and sources sharpens remote sensing images

    SatFusion: A Unified Framework for Enhancing Remote Sensing Images via Multi-Frame and Multi-Source Images Fusion

    Yufei Tong +5

  48. cs.MM 2025-09-30 reviewed
    Twin DiT modules generate synced audio and video in one pass

    Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

    Chetwin Low +2

  49. cs.CV 2025-09-29 reviewed
    Latent space method edits heart rate in videos

    Editing Physiological Signals in Videos Using Latent Representations

    Tianwen Zhou +3

  50. cs.SD 2025-09-27 reviewed
    TV dialogue dataset raises voice role-play scores 38 percent

    AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

    Wenyu Li +4