archive
Every paper Pith has read. Search by title, abstract, or pith.
378 papers in cs.MM · page 6
-
ProCap separates projections from physical scenes in AR
ProCap: Projection-Aware Captioning for Spatial Augmented Reality
-
Hypergraph contrastive learning recovers 3D crowd meshes
Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery
-
Movie dialogues train AI for finer multimodal control
From Natural Alignment to Conditional Controllability in Multimodal Dialogue
-
Comic stories bypass safety in multimodal AI models
Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models
-
Semantic fields predict gaze in 360 video streams without training
Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields
-
OmniTrace traces each output token to supporting input spans
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
-
AI models miss key dental referrals that junior dentists catch
Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage
-
Korean benchmark caps multimodal models at 42 percent accuracy
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
-
Discrete flow matching improves video dubbing sync
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization
-
Zero-pair model aligns video to music via event curves
V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
-
Simple captions lift text-to-video recall rates
Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis
-
VDCook turns natural language into continuously updating video datasets
VDCook:DIY video data cook your MLLMs
-
Valid C2PA claim can assert human origin for AI-watermarked image
Authenticated Contradictions from Desynchronized Provenance and Watermarking
-
Pyramidal memory distills long videos into semantic schemas
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
-
ModalImmune builds resilience in multimodal models by deliberately collapsing selected…
ModalImmune: Immunity Driven Unlearning via Self Destructive Training
-
Hyperbolic hypergraphs recover emotions from incomplete multimodal signals
Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection
-
Gradient maps detect CU steganography in HEVC videos
H.265/HEVC Video Steganalysis Based on CU Block Structure Gradients and IPM Mapping
-
Edge server filters XR frames to protect privacy in cloud AI queries
PRISM-XR: Empowering Privacy-Aware XR Collaboration with Multimodal Large Language Models
-
Label-free system judges AI images by self-built pairs
ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images
-
Unpaired text replaces paired image data for MLLM pretraining
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
-
Benchmark diagnoses failures in multi-talker AI videos
MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation
-
Binary codes match dense vectors in wildlife observation retrieval
Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval
-
Benchmark tests image editors on bilingual dense documents
VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents
-
Multimodal sensors track developmental shifts in young laying hens
Multimodal Digital Sensing of Early-Life Laying Hens: A Pilot Study Integrating Thermal, Acoustic, Optical-Flow and Environmental Data
-
Multimodal model handles missing data to predict lung cancer survival
Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer
-
Dynamic benchmark VeriTaS adds claims quarterly to block pretraining leakage
VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking
-
Single model hits top scores on video
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
-
View cone sampling improves 3D saliency maps in VR
Robust Mesh Saliency Ground Truth Acquisition in VR via View Cone Sampling and Manifold Diffusion
-
Global context plus text yields better few-shot fonts
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
-
LinMU matches VLM accuracy with linear complexity
LinMU: Multimodal Understanding Made Linear
-
Layer masking and subspace split generalize deepfake detection
Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition
-
Federated clustering uses tensor low-rank to share client structure
Federated Multi-Task Clustering
-
Fusion of LVLM hidden states with IDs beats captions for micro-video recs
Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion
-
Decoupled streams lift AV speaker detection to 95.6% accuracy
Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD
-
Video moderation cuts communication 28x with on-device privacy
FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation
-
Best Omni-LLMs score only 65.3% on joint audio-visual benchmark
JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
-
This paper describes a method to add selected speech tokens from an ASR tokenizer into…
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
-
Synthetic pipeline builds balanced video anomaly benchmark
Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks
-
Calibration fixes anchor shift from missing modalities
Calibrated Multimodal Representation Learning with Missing Modalities
-
Diffusion models let receivers rebuild content from tiny semantic cues
Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications
-
Gaussian split lifts dynamic 3D quality on sparse camera setups
Splatography: Sparse multi-view dynamic Gaussian Splatting for filmmaking challenges
-
Bangladesh AI scores 75-80% on bar exams at under 1% cost
Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice
-
Webcam gestures turn into continuous music at 30 ms latency
Gesture2Music: A Low-Latency Real-Time Framework for Continuous Gesture-Driven Music Generation
-
Visual keys occupy separate subspace from text keys in MLLMs
MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning
-
Google Drive leads in consistent cloud upload performance on Wi-Fi and LTE
Performance Evaluation of Multimedia Traffic in Cloud Storage Services over Wi-Fi and LTE Networks
-
AV1 motion vectors speed up optical flow fourfold
AV1 Motion Vector Fidelity and Application for Efficient Optical Flow
-
Unified fusion of frames and sources sharpens remote sensing images
SatFusion: A Unified Framework for Enhancing Remote Sensing Images via Multi-Frame and Multi-Source Images Fusion
-
Twin DiT modules generate synced audio and video in one pass
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
-
Latent space method edits heart rate in videos
Editing Physiological Signals in Videos Using Latent Representations
-
TV dialogue dataset raises voice role-play scores 38 percent
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models