pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 2

  1. cs.MM 2026-05-12 reviewed
    3B omni model matches 30B on clean benchmarks

    Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    Che Liu +6

  2. cs.MM 2026-05-12 reviewed
    3B omni-model matches 30B on debiased benchmarks

    Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    Che Liu +6

  3. cs.IR 2026-05-12 reviewed
    ZipRerank matches top multimodal rerankers at 10x lower latency

    Very Efficient Listwise Multimodal Reranking for Long Documents

    Yiqun Sun +2

  4. cs.IR 2026-05-12 reviewed
    Critic and generator agents iteratively refine research outlines

    AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents

    Jiarui Jin +4

  5. cs.MM 2026-05-12 reviewed
    Adaptive path choice lifts unified multimodal reasoning

    UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

    Hayes Bai +4

  6. cs.CV 2026-05-11 reviewed
    Unified transformer generates images from raw pixels without VAEs

    HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    Qi Cai +24

  7. cs.MM 2026-05-11 reviewed
    Targeted head boost cuts hallucinations in vision-language models

    Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    Yangneng Chen +6

  8. cs.MM 2026-05-11 reviewed
    Benchmark shows AI models struggle with evidence in multimodal fact-checking

    RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

    Danni Xu +3

  9. cs.MM 2026-05-11 reviewed
    New benchmark links social posts to fact-check evidence for model testing

    RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

    Danni Xu +3

  10. cs.MM 2026-05-11 reviewed
    User queries alter video retrieval model behavior

    FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

    Qijie You +6

  11. eess.IV 2026-05-11 reviewed
    Tube packages stabilize video recovery faster in semantic HARQ

    Tube-Structured Incremental Semantic HARQ for Generative Video Receivers

    Xuesong Wang +2

  12. eess.IV 2026-05-10 reviewed
    Neural preprocessor lifts H.264 perceptual scores 27 percent on UVG

    Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG

    Marco Graziano

  13. cs.CV 2026-05-10 reviewed
    Multi-scale supervision cuts pose errors in sign animation

    KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

    Guanyi Du +3

  14. eess.IV 2026-05-10 reviewed
    Multi-layer CLIP similarities predict machine image preferences

    ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality

    Feng Ding +5

  15. cs.MM 2026-05-10 reviewed
    Dual pathways fix conflicts in text-video-audio intent recognition

    Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition

    Yifan Wang +4

  16. cs.CV 2026-05-10 reviewed
    Invariant relations to known prototypes turn GCD into reliable pattern matching

    Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery

    Yulin Xu +3

  17. cs.AI 2026-05-10 reviewed
    Three agents refine knowledge to lift few-shot time series classification

    Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

    Lin Li +9

  18. cs.AI 2026-05-10 reviewed
    Three-agent system lifts VLM accuracy on few-shot time series tasks

    Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

    Lin Li +9

  19. cs.CL 2026-05-10 reviewed
    Home activity benchmark shows AI question-answering gaps

    HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities

    Shusaku Egami +7

  20. cs.GR 2026-05-10 reviewed
    Color-adaptive scheme raises 3D Gaussian streaming quality 5-20 dB

    CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting

    Daheng Yin +9

  21. cs.CV 2026-05-09 reviewed
    Gaussian splatting relights VP scenes by sampling LED backgrounds directly

    Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

    Adrian Azzarelli +3

  22. eess.IV 2026-05-09 reviewed
    Neural network adapts frame rate and resolution for better streamed graphics

    Streaming of rendered content with adaptive frame rate and resolution

    Yaru Liu +2

  23. cs.MM 2026-05-09 reviewed
    Edge offloading and pruning cut multi-condition T2I latency by 25%

    Accelerating Multi-Condition T2I Generation via Adaptive Condition Offloading and Pruning

    Yuxin Kong +4

  24. cs.CV 2026-05-09 reviewed
    Unison aligns motion, speech and sound in video generation

    Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

    Shihao Cheng +8

  25. cs.CV 2026-05-09 reviewed
    Uni-modal focus sharpens weakly supervised AVVP

    EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

    Huilai Li +5

  26. eess.IV 2026-05-09 reviewed
    Thin clients stream interactive 3D Gaussian Splatting over HTTP/3

    Thin-Client Interactive Gaussian Adaptive Streaming over HTTP/3

    Emanuele Artioli +6

  27. cs.MM 2026-05-08 reviewed
    Anisotropic correction fixes modality gaps for unpaired training

    Anisotropic Modality Align

    Xiaomin Yu +10

  28. cs.MM 2026-05-08 reviewed
    Multimedia benchmark shows access method guides terminal agent workflows

    MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

    Chiyeong Heo +6

  29. cs.SD 2026-05-08 reviewed
    Decomposed stages yield better chord variety and rules compliance

    A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

    Qiqi He +3

  30. cs.CR 2026-05-08 reviewed
    Honeywell deleted videos remain recoverable

    Forensic analysis of video data deletion and recovery in Honeywell surveillance file system

    Jinhee Yoon +1

  31. cs.GR 2026-05-08 reviewed
    Semantic codebook creates style-matched co-speech gestures

    PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation

    Junchuan Zhao +2

  32. cs.SD 2026-05-08 reviewed
    Audio-video models fail to keep physics consistent in transitions

    Do Joint Audio-Video Generation Models Understand Physics?

    Zijun Cui +10

  33. cs.CL 2026-05-07 reviewed
    MIST benchmark shows LLMs lag on voice IoT tasks

    MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

    Maximillian Chen +5

  34. cs.CV 2026-05-07 reviewed
    Benchmark shows little progress in multimodal domain generalization

    Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

    Hao Dong +5

  35. eess.IV 2026-05-07 reviewed
    Neural codec with FFT encoder outperforms tokenizers on sensors

    LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

    Dan Jacobellis +1

  36. cs.MM 2026-05-07 reviewed
    Contrastive and uncertainty methods improve emotion recognition

    Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

    Yan Zhuang +4

  37. cs.CV 2026-05-07 reviewed
    The paper introduces Holmes, a hierarchical evidential learning method for retrieving…

    Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval

    Jun Li +7

  38. cs.CV 2026-05-07 reviewed
    LLM and RL coupling with VR feedback creates adaptive 3D scenes

    Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

    Anh H. Vo +4

  39. cs.MM 2026-05-06 reviewed
    Dual paths learn when to fuse or drop modalities in emotion recognition

    To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

    Yangchen Yu +7

  40. cs.SD 2026-05-05 reviewed
    0.1B omni model reaches 0.09 CER in speech-text consistency

    MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

    Jingyao Gong

  41. cs.CV 2026-05-05 reviewed
    Conformal loop self-calibrates multimodal models on noisy

    Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

    Xun Jiang +7

  42. cs.MM 2026-05-05 reviewed
    Imitation learning splits music colors across multiple stage lights

    Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

    Zijian Zhao +3

  43. cs.SD 2026-05-05 reviewed
    Aesthetic features lift AI music preference prediction on unseen generators

    APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

    Jaavid Aktar Husain +1

  44. cs.CV 2026-05-05 reviewed
    Dual-system refines scores to boost self-supervised forgery detectors

    Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

    Ke Liu +7

  45. cs.LG 2026-05-05 reviewed
    Dimension-aware quantiles stabilize multimodal graph unlearning

    Stable Multimodal Graph Unlearning via Feature-Dimension Aware Quantile Selection

    Jingjing Zhou +7

  46. cs.MM 2026-05-04 reviewed
    Reservoir of k streams bounds uptime by harmonic number

    The Streaming Reservoir Convergence Theorem: A Prospect-Theoretic Framework for Multi-Provider Adaptive Streaming

    Justice Owusu Agyemang +6

  47. cs.MM 2026-05-04 reviewed
    CPR restores periodic structure from locally private time series

    Period-conscious Time-series Reconstruction under Local Differential Privacy

    Yaxuan Wang +4

  48. cs.SD 2026-05-04 reviewed
    Offline distillation from DP teacher prevents collapse in private speech classifiers

    Private Speech Classification without Collapse: Stabilized DP Training and Offline Distillation

    Yadi Wen +4

  49. cs.CV 2026-05-04 reviewed
    Video search now returns multiple moments or none

    Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval

    Yiming Ding +6

  50. cs.MM 2026-05-03 reviewed
    Nine systems compete in revived expressive piano rendering contest

    RenCon 2025: Revival of the Expressive Performance Rendering Competition

    Huan Zhang +9