pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 5

  1. cs.MM 2026-04-10 reviewed
    Gaze-matched tuning improves AI simulation of user clicks

    Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

    Lingfeng Huang +4

  2. cs.MM 2026-04-10 reviewed
    Tri-stage pruning speeds MVLA inference 2.55x by tracking 2D/3D salience

    2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

    Zihao Zheng +10

  3. cs.MM 2026-04-10 reviewed
    Self-generated pseudo-fakes boost deepfake detector generalization

    Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes

    Zihe Wei +1

  4. cs.CV 2026-04-10 reviewed
    Frozen vision models locate image manipulations via a small adapter

    Off-the-shelf Vision Models Benefit Image Manipulation Localization

    Zhengxuan Zhang +4

  5. cs.CV 2026-04-10 reviewed
    Trajectories align motion and sound in AV generation

    Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

    Junchao Liao +7

  6. cs.SD 2026-04-10 reviewed
    Hierarchical model generates vocal accompaniments matching SOTA

    HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation

    Jian Zhu +4

  7. cs.CV 2026-04-09 reviewed
    Compact model beats most VLMs on explainable sensitive content

    SenBen: Sensitive Scene Graphs for Explainable Content Moderation

    Fatih Cagatay Akyon +1

  8. cs.MM 2026-04-09 reviewed
    Fine-tuned LLMs translate QoS to QoE and back with strong accuracy

    QoS-QoE Translation with Large Language Model

    Yingjie Yu +5

  9. cs.CV 2026-04-09 reviewed
    SemJudge judges AI art by symbolic and indexical meaning

    On Semiotic-Grounded Interpretive Evaluation of Generative Art

    Ruixiang Jiang +1

  10. eess.IV 2026-04-09 reviewed
    INR conditioning lifts perceptual quality at under 0.05 bpp

    DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning

    Eren \c{C}etin +5

  11. cs.CR 2026-04-09 reviewed
    Multimodal model adds readable explanations to encrypted traffic

    Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

    Longgang Zhang +3

  12. eess.IV 2026-04-09 reviewed
    HEVC ROI encryption reaches exact 8x8 coding-unit precision

    A H.265/HEVC Fine-Grained ROI Video Encryption Algorithm Based on Coding Unit and Prompt Segmentation

    Xiang Zhang +6

  13. cs.CV 2026-04-09 reviewed
    UAV dataset with 6-DoF paths improves world model 3D predictions

    MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

    Zile Guo +6

  14. cs.CV 2026-04-09 reviewed
    Model turns audio into real-time character videos with stable identity

    LPM 1.0: Video-based Character Performance Model

    Ailing Zeng +24

  15. cs.CV 2026-04-09 reviewed
    Cross-modal attention improves audio-visual deepfake detection

    MSCT: Differential Cross-Modal Attention for Deepfake Detection

    Fangda Wei +5

  16. cs.CV 2026-04-08 reviewed
    Vision models give inconsistent cultural metadata from images

    Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

    Yuechen Jiang +6

  17. cs.IR 2026-04-08 reviewed
    Benchmark tests AI on comparing music across track pairs

    Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering

    Junyoung Koh +7

  18. cs.HC 2026-04-08 reviewed
    Multimodal signals outperform video for predicting driver automation transitions

    BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving

    Yuhang Wang +5

  19. cs.CV 2026-04-08 reviewed
    SurFITR dataset shows forgery detectors fail on surveillance scenes

    SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

    Qizhou Wang +2

  20. cs.MM 2026-04-08 reviewed
    Benchmark of 1000 real cases tests AI lung-cancer reasoning

    LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment

    Fangyu Hao +16

  21. cs.CY 2026-04-08 reviewed
    AI tools turn course notes into useful videos for EAP students

    AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

    David James Woo +2

  22. cs.CV 2026-04-08 reviewed
    Uncertainty Gaussians improve text-image sarcasm detection

    URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

    Zhenyu Wang +5

  23. cs.HC 2026-04-07 reviewed
    Text prompts generate matching haptic and visual textures

    Language-Guided Multimodal Texture Authoring via Generative Models

    Wanli Qian +4

  24. cs.LG 2026-04-07 reviewed
    Graph embeddings flag microservice anomalies missed by load tests

    From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

    Srinidhi Madabhushi +5

  25. cs.CV 2026-04-07 reviewed
    Paired food photos let vision models estimate exact consumption

    DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images

    Gautham Vinod +3

  26. cs.CV 2026-04-07 reviewed
    Graph structure improves part-based image coherence

    Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

    Junbin Zhang +4

  27. cs.MM 2026-04-07 reviewed
    Shared prototype space refines multimodal sentiment predictions

    Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis

    Chen Su +2

  28. cs.CV 2026-04-07 reviewed
    Benchmark localizes hallucinations at token level in 200-word captions

    DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

    Xinran Wang +9

  29. cs.CV 2026-04-07 reviewed
    Bounding boxes lock composed queries to exact instances

    Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

    Yuxin Yang +8

  30. cs.MM 2026-04-07 reviewed
    Edge model gates MLLM calls to cut video alert delay 77 percent

    DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems

    Qi Guo +4

  31. eess.IV 2026-04-07 reviewed
    Channel importance boosts machine vision codec performance

    CI-ICM: Channel Importance-driven Learned Image Coding for Machines

    Yun Zhang +5

  32. cs.MM 2026-04-07 reviewed
    LLM pipeline creates STEM animations that raise test scores over slides

    LLM2Manim: Pedagogy-Aware AI Generation of STEM Animations

    Aastha Joshi +5

  33. cs.MA 2026-04-06 reviewed
    Coordination mechanism improves music video edits

    GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

    Zihao Lin +9

  34. cs.CV 2026-04-06 reviewed
    Multi-agent system creates coherent video mashups

    DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

    Ke Li +6

  35. cs.CV 2026-04-06 reviewed
    Event overlay lifts robot pick success from 0% to 90% in dark

    E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

    Jiajun Zhai +4

  36. eess.IV 2026-04-06 reviewed
    Semantic priors from transformers reduce depth boundary artifacts

    NAIMA: Semantics Aware RGB Guided Depth Super-Resolution

    Tayyab Nasir +2

  37. cs.CV 2026-04-06 reviewed
    New diffusion model turns music into editable 3D conductor motions

    BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

    Tianzhi Jia +6

  38. cs.SD 2026-04-06 reviewed
    Model generates complete audio scenes with speech from video and text

    OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

    Weiguo Pian +6

  39. cs.CL 2026-04-04 reviewed
    LightThinker++ cuts LLM peak tokens by 70% while raising accuracy

    LightThinker++: From Reasoning Compression to Memory Management

    Yuqi Zhu +9

  40. cs.CV 2026-04-04 reviewed
    Denoising-stage filter cuts image check time by up to 79 percent

    EDGE-Shield: Efficient Denoising-staGE Shield for Violative Content Filtering via Scalable Reference-Based Matching

    Takara Taniguchi +5

  41. cs.CV 2026-04-03 reviewed
    Dual-domain edges lift UAV detection to 36.8 AP

    SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection

    Wenfeng Zhang +6

  42. eess.IV 2026-04-03 reviewed
    New dataset shows foreground degradations drive AR quality

    ARIQA-3DS: A Stereoscopic Image Quality Assessment Dataset for Realistic Augmented Reality

    Aymen Sekhri +2

  43. cs.CV 2026-04-03 reviewed
    Middle-layer evidence cuts video model hallucinations

    STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

    Linfeng Fan +3

  44. cs.CV 2026-04-03 reviewed
    SentiAvatar turns speech into real-time 3D avatar gestures and expressions

    SentiAvatar: Towards Expressive and Interactive Digital Humans

    Chuhao Jin +7

  45. eess.IV 2026-04-03 reviewed
    Streaming 3D Gaussians improves viewpoint flexibility over video

    Streaming Real-Time Rendered Scenes as 3D Gaussians

    Matti Siekkinen +1

  46. cs.CV 2026-04-03 reviewed
    PaveBench adds interactive QA to pavement distress benchmarks

    PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis

    Dexiang Li +5

  47. cs.MM 2026-04-03 reviewed
    Psychology stimuli improve distinction of mental disorders

    Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli

    Zhiyuan Zhou +11

  48. cs.CV 2026-04-03 reviewed
    AI now builds video trailers instead of selecting clips

    Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity

    Abhishek Dharmaratnakar +3

  49. cs.CV 2026-04-03 reviewed
    Smart Transfer enables fast earthquake damage maps from satellite images

    Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery

    Hao Li +6

  50. cs.CV 2026-04-02 reviewed
    Text descriptions locate urban positions on OSM tiles to meter accuracy

    TOL: Textual Localization with OpenStreetMap

    Youqi Liao +8