archive
Every paper Pith has read. Search by title, abstract, or pith.
378 papers in cs.MM · page 5
-
Gaze-matched tuning improves AI simulation of user clicks
Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation
-
Tri-stage pruning speeds MVLA inference 2.55x by tracking 2D/3D salience
2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness
-
Self-generated pseudo-fakes boost deepfake detector generalization
Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes
-
Frozen vision models locate image manipulations via a small adapter
Off-the-shelf Vision Models Benefit Image Manipulation Localization
-
Trajectories align motion and sound in AV generation
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
-
Hierarchical model generates vocal accompaniments matching SOTA
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
-
Compact model beats most VLMs on explainable sensitive content
SenBen: Sensitive Scene Graphs for Explainable Content Moderation
-
Fine-tuned LLMs translate QoS to QoE and back with strong accuracy
QoS-QoE Translation with Large Language Model
-
SemJudge judges AI art by symbolic and indexical meaning
On Semiotic-Grounded Interpretive Evaluation of Generative Art
-
INR conditioning lifts perceptual quality at under 0.05 bpp
DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning
-
Multimodal model adds readable explanations to encrypted traffic
Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark
-
HEVC ROI encryption reaches exact 8x8 coding-unit precision
A H.265/HEVC Fine-Grained ROI Video Encryption Algorithm Based on Coding Unit and Prompt Segmentation
-
UAV dataset with 6-DoF paths improves world model 3D predictions
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
-
Model turns audio into real-time character videos with stable identity
LPM 1.0: Video-based Character Performance Model
-
Cross-modal attention improves audio-visual deepfake detection
MSCT: Differential Cross-Modal Attention for Deepfake Detection
-
Vision models give inconsistent cultural metadata from images
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
-
Benchmark tests AI on comparing music across track pairs
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
-
Multimodal signals outperform video for predicting driver automation transitions
BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving
-
SurFITR dataset shows forgery detectors fail on surveillance scenes
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
-
Benchmark of 1000 real cases tests AI lung-cancer reasoning
LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment
-
AI tools turn course notes into useful videos for EAP students
AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course
-
Uncertainty Gaussians improve text-image sarcasm detection
URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection
-
Text prompts generate matching haptic and visual textures
Language-Guided Multimodal Texture Authoring via Generative Models
-
Graph embeddings flag microservice anomalies missed by load tests
From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
-
Paired food photos let vision models estimate exact consumption
DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images
-
Graph structure improves part-based image coherence
Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors
-
Shared prototype space refines multimodal sentiment predictions
Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis
-
Benchmark localizes hallucinations at token level in 200-word captions
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
-
Bounding boxes lock composed queries to exact instances
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
-
Edge model gates MLLM calls to cut video alert delay 77 percent
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems
-
Channel importance boosts machine vision codec performance
CI-ICM: Channel Importance-driven Learned Image Coding for Machines
-
LLM pipeline creates STEM animations that raise test scores over slides
LLM2Manim: Pedagogy-Aware AI Generation of STEM Animations
-
Coordination mechanism improves music video edits
GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing
-
Multi-agent system creates coherent video mashups
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing
-
Event overlay lifts robot pick success from 0% to 90% in dark
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
-
Semantic priors from transformers reduce depth boundary artifacts
NAIMA: Semantics Aware RGB Guided Depth Super-Resolution
-
New diffusion model turns music into editable 3D conductor motions
BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion
-
Model generates complete audio scenes with speech from video and text
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
-
LightThinker++ cuts LLM peak tokens by 70% while raising accuracy
LightThinker++: From Reasoning Compression to Memory Management
-
Denoising-stage filter cuts image check time by up to 79 percent
EDGE-Shield: Efficient Denoising-staGE Shield for Violative Content Filtering via Scalable Reference-Based Matching
-
Dual-domain edges lift UAV detection to 36.8 AP
SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection
-
New dataset shows foreground degradations drive AR quality
ARIQA-3DS: A Stereoscopic Image Quality Assessment Dataset for Realistic Augmented Reality
-
Middle-layer evidence cuts video model hallucinations
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models
-
SentiAvatar turns speech into real-time 3D avatar gestures and expressions
SentiAvatar: Towards Expressive and Interactive Digital Humans
-
Streaming 3D Gaussians improves viewpoint flexibility over video
Streaming Real-Time Rendered Scenes as 3D Gaussians
-
PaveBench adds interactive QA to pavement distress benchmarks
PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis
-
Psychology stimuli improve distinction of mental disorders
Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli
-
AI now builds video trailers instead of selecting clips
Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity
-
Smart Transfer enables fast earthquake damage maps from satellite images
Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery
-
Text descriptions locate urban positions on OSM tiles to meter accuracy
TOL: Textual Localization with OpenStreetMap