pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

378 papers in cs.MM · page 7

  1. cs.IR 2025-09-27 reviewed
    Bioart clusters into four patterns across 13 dimensions

    BioArtlas: Computational Clustering of Multi-Dimensional Complexity in Bioart

    Joonhyung Bae

  2. cs.SD 2025-09-26 reviewed
    VLM generates music from images with no training or fine-tuning

    Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

    Zijian Zhao +2

  3. cs.GR 2025-09-24 reviewed
    Dual-path diffusion sharpens lip sync and head poses

    KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

    Tianle Lyu +2

  4. cs.GR 2025-09-23 reviewed
    Text Slider makes concept control 5x faster in diffusion models

    Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters

    Pin-Yen Chiu +2

  5. cs.SD 2025-09-22 reviewed
    Model turns video into object-aware stereo sound

    StereoFoley: Object-Aware Stereo Audio Generation from Video

    Tornike Karchkhadze +6

  6. cs.CL 2025-08-06 reviewed
    Medical LLM benchmarks fail clinical and safety checks

    Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

    Wenting Chen +7

  7. cs.MM 2025-08-05 reviewed
    Dataset supplies 14,187 lifelog Q&A pairs from personal data

    OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset

    Quang-Linh Tran +5

  8. cs.SD 2025-08-05 reviewed
    One model restores and masters music from text instructions

    SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

    Jan Melechovsky +3

  9. cs.LG 2025-07-20 reviewed
    Reasoning benchmark lets models explain time series anomalies

    Time-RA: Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback

    Yiyuan Yang +8

  10. cs.MM 2025-06-04 reviewed
    Agentic framework cuts missing modality errors by 14 percent

    How Far Are We from Generating Missing Modalities with Foundation Models?

    Guanzhou Ke +4

  11. cs.CV 2025-05-28 reviewed
    17B sparse DiT generates SOTA images in seconds

    HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

    Qi Cai +21

  12. cs.SD 2025-05-28 reviewed
    Fixed decoder raises audio steganography quality by over 10 dB

    FGAS: Fixed Decoder Network-Based Audio Steganography with Adversarial Perturbation Generation

    Jialin Yan +6

  13. cs.CV 2025-05-27 reviewed
    Confidence signals guide attention to cut hallucinations

    Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

    Mehrdad Fazli +3

  14. cs.SD 2025-05-27 reviewed
    Tailored designs succeed on music AVQA where general models struggle

    Music Audio-Visual Question Answering Requires Specialized Multimodal Designs

    Wenhao You +11

  15. cs.CV 2025-05-22 reviewed
    RL unlocks autonomous reasoning in text-to-image models

    GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

    Chengqi Duan +7

  16. cs.CV 2025-05-21 reviewed
    Dual encoders let LLM score video quality with context plus pixel detail

    Context and Pixel Aware Large Language Model for Video Quality Assessment

    Wen Wen +5

  17. cs.CV 2025-05-21 reviewed
    Simulated intent data trains models to spot news deception

    Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

    Jiaying Wu +4

  18. cs.MM 2025-05-16 reviewed
    Drifted CLIP concepts improve meme metaphor detection at lower cost

    Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

    Wenhao Qian +3

  19. cs.SD 2025-05-15 reviewed
    Random linear map turns audio embeddings into dynamic visuals

    LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2

    Jongmin Jung +1

  20. eess.IV 2025-05-06 reviewed
    Dynamic snapshots raise CRLM survival prediction accuracy

    A Dynamic Prognostic Prediction Method for Colorectal Cancer Liver Metastasis

    Wei Yang +5

  21. eess.AS 2025-04-25 reviewed
  22. cs.CV 2025-04-15 reviewed
    VLM agreement entropy flags OCR errors without labels

    Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

    Yulong Zhang +7

  23. cs.MM 2025-04-14 reviewed
    Clustering evidence into narratives improves fact-checking

    Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis

    Arka Ujjal Dey +4

  24. cs.CV 2025-04-05 reviewed
    Collaborative calibration boosts KB-VQA accuracy by 4.7%

    Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

    Jiaqi Deng +5

  25. cs.AR 2025-03-26 reviewed
    Edge criteria halve MACs for 8K super-resolution at 30 FPS

    ESSR: An 8K@30FPS Super-Resolution Accelerator With Edge Selective Network

    Chih-Chia Hsu +1

  26. cs.MM 2025-03-13 reviewed
    One model turns text, video or audio prompts into sound

    AudioX: A Unified Framework for Anything-to-Audio Generation

    Zeyue Tian +8

  27. cs.CV 2025-03-10 reviewed
    Video-to-IMU distillation matches supervised HAR without labels

    COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

    Baiyu Chen +5

  28. cs.CV 2025-03-09 reviewed
    Reinforcement learning induces reasoning for image segmentation

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu +6

  29. cs.CV 2025-02-19 reviewed
    Cross-attention generates context-aware human poses in scenes

    Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

    Prasun Roy +4

  30. cs.DC 2024-12-27 reviewed
    Cloud gaming system serves twice as many users at higher quality

    Stimpack: An Adaptive Rendering Optimization System for Scalable Cloud Gaming

    Jin Heo +3

  31. cs.CV 2024-11-28 reviewed
    Spatially varying 2D Gaussians beat single-color 3D ones on view synthesis

    SVGS: Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors

    Rui Xu +9

  32. cs.MM 2024-11-26 reviewed
    Both global and shared position IDs align video text and speech

    Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

    Akshita Gupta +5

  33. cs.MM 2024-11-08 reviewed
    Model predicts video quality curves for adaptive transcoding

    Content-Adaptive Rate-Quality Curve Prediction Model in Media Processing System

    Shibo Yin +6

  34. cs.SD 2024-11-05 reviewed
    Slide text cues extract target speaker from mixed audio

    pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

    Ziyang Jiang +6

  35. cs.MM 2024-10-28 reviewed
    Survey splits document parsing into pipelines and VLM models

    Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

    Qintong Zhang +7

  36. cs.SD 2024-10-23 reviewed
    Equivariant transformer beats prototype on chord accompaniment

    Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

    Weiliang Luo

  37. cs.MM 2024-09-26 reviewed
    Model predicts live streaming QoE from video features alone

    Subjective and Objective Quality-of-Experience Evaluation Study for Live Video Streaming

    Zehao Zhu +9

  38. cs.CV 2024-04-15 reviewed
    LLM subject boost raises T2I consistency on complex captions

    ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis

    Aashish Anantha Ramakrishnan +2

  39. cs.CV 2024-01-08 reviewed
    3D Gaussian splatting delivers real-time explicit rendering

    A Survey on 3D Gaussian Splatting

    Guikun Chen +1

  40. cs.CV 2023-10-09 reviewed
    New tokenizer lets LLMs beat diffusion models on visuals

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu +15

  41. cs.CV 2023-06-26 reviewed
    Negative instructions reduce hallucinations in multi-modal models

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu +5

  42. cs.AI 2023-06-07 reviewed
    Diffusion model rebuilds images from noisy compressed semantics

    Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

    Eleonora Grassucci +2

  43. cs.CV 2023-05-17 reviewed
    LVLMs describe objects missing from the image

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li +5

  44. cs.CV 2023-04-28 reviewed
    LLaMA-Adapter V2 adds visual instructions to LLaMA with 14M parameters

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao +11

  45. cs.CV 2023-03-28 reviewed
    LLaMA-Adapter tunes 7B model with 1.2M parameters

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang +9

  46. cs.CV 2023-02-16 reviewed
    Small adapters add precise control to frozen text-to-image models

    T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

    Chong Mou +7

  47. cs.CV 2023-02-10 reviewed
    ControlNet adds edge, pose and depth controls to diffusion image models

    Adding Conditional Control to Text-to-Image Diffusion Models

    Lvmin Zhang +2

  48. cs.CV 2022-05-04 reviewed
    CoCa hits 91% ImageNet by joint contrastive and caption training

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu +5

  49. cs.LG 2019-07-27 reviewed
    Instance-specific modality selection beats full set for multi-label tasks

    Many could be better than all: A novel instance-oriented algorithm for Multi-modal Multi-label problem

    Yi Zhang +4

  50. eess.IV 2019-07-26 reviewed
    Inverse CRF corrects color in multi-exposure fused images

    A Color Compensation Method Using Inverse Camera Response Function for Multi-exposure Image Fusion

    Artit Visavakitcharoen +2