The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
The first benchmark for AI-generated scientific figure detection shows existing detectors fail in zero-shot transfer, overfit to specific generators, and break under common image corruptions.
Image meanings grow more context-dependent with semantic abstraction, requiring narrative grounding for accurate retrieval at higher levels.
SpatialGrammar provides a grid-based DSL and compiler that lets LLMs generate collision-free 3D indoor scenes more reliably than raw-coordinate or code-based approaches.
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
CFSR reframes shadow removal as a physics-constrained process using geometric and semantic priors from depth, DINO, CLIP, and frequency decoupling to achieve claimed state-of-the-art results.
RefVQA uses a query-centered reference graph and graph-guided difference aggregation to improve AI-generated video quality assessment by incorporating inter-video comparisons.
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
DIRECT uses a three-level multi-agent framework to solve video mashup creation as a multimodal coherency problem, outperforming baselines on a new benchmark.
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
A feature supervision approach using SigLIP 2 extracts multi-granularity vision-aligned text representations to supervise MM-DiT image branches, pushing the Pareto frontier for portrait generation across alignment, realism, and aesthetics.
OCCAM discovers open-set visual concepts, estimates causal contributions via object-level interventions on black-box vision models, and induces a global concept ontology from aggregated dataset evidence.
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
ArtifactWorld restores artifacts in 3D Gaussian Splatting by training a video diffusion backbone on 107.5K paired clips with an isomorphic predictor for artifact heatmaps and an Artifact-Aware Triplet Fusion mechanism to achieve better sparse-view novel synthesis.
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
citing papers explorer
-
Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
-
SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection
The first benchmark for AI-generated scientific figure detection shows existing detectors fail in zero-shot transfer, overfit to specific generators, and break under common image corruptions.
-
Same Image, Different Meanings: Toward Retrieval of Context-Dependent Meanings
Image meanings grow more context-dependent with semantic abstraction, requiring narrative grounding for accurate retrieval at higher levels.
-
SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation
SpatialGrammar provides a grid-based DSL and compiler that lets LLMs generate collision-free 3D indoor scenes more reliably than raw-coordinate or code-based approaches.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
-
CFSR: Geometry-Conditioned Shadow Removal via Physical Disentanglement
CFSR reframes shadow removal as a physics-constrained process using geometric and semantic priors from depth, DINO, CLIP, and frequency decoupling to achieve claimed state-of-the-art results.
-
Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment
RefVQA uses a query-centered reference graph and graph-guided difference aggregation to improve AI-generated video quality assessment by incorporating inter-video comparisons.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
-
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
-
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing
DIRECT uses a three-level multi-agent framework to solve video mashup creation as a multimodal coherency problem, outperforming baselines on a new benchmark.
-
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
-
Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics
A feature supervision approach using SigLIP 2 extracts multi-granularity vision-aligned text representations to supervise MM-DiT image branches, pushing the Pareto frontier for portrait generation across alignment, realism, and aesthetics.
-
OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models
OCCAM discovers open-set visual concepts, estimates causal contributions via object-level interventions on black-box vision models, and induces a global concept ontology from aggregated dataset evidence.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
-
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
-
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
-
ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models
ArtifactWorld restores artifacts in 3D Gaussian Splatting by training a video diffusion backbone on 107.5K paired clips with an isomorphic predictor for artifact heatmaps and an Artifact-Aware Triplet Fusion mechanism to achieve better sparse-view novel synthesis.
-
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
-
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
-
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
-
ScaleDoc: Scaling LLM-based Predicates over Large Document Collections
ScaleDoc achieves over 2x end-to-end speedup and up to 85% fewer LLM invocations for semantic predicates on large document collections via offline LLM representations, contrastive-trained proxy filtering, and adaptive cascades.
-
A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset
An empirical audit of one web-scraped ML training dataset reveals persistent PII after sanitization, which the authors combine with legal analysis to highlight privacy risks and advocate redefining 'publicly available' data for AI training.
-
MINT: Multi-Vector Search Index Tuning
MINT defines multi-vector search index tuning and provides algorithms that achieve 2.1X to 8.3X latency speedup over baselines under storage and recall constraints.
-
Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models
DEX is a modular network using dynamically activated experts and a group-EMA director to learn emergent modular representations for multi-modality medical vision foundation models, evaluated on a new 4M-image benchmark across 10 modalities and 26 downstream tasks.
-
DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer
DealMaTe proposes a simplified diffusion framework for material transfer that injects multi-dimensional 3D conditions via Multi-Dim 3D Shader LoRA and Shader Causal Mutual Attention with KV caching.
-
Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding
A unified cost-aware formulation couples fine-grained high-resolution sampling decisions with cross-patch representation prediction to achieve superior performance-cost trade-offs on remote sensing recognition and retrieval tasks using a new 10M-image benchmark.
-
Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection
A text-guided fusion method for RGB-IR object detection aligns modalities via semantic bridging and incorporates both consensus and discrepancy cues through dynamic recalibration.
-
On Semiotic-Grounded Interpretive Evaluation of Generative Art
SemJudge uses a Hierarchical Semiosis Graph based on Peircean theory to evaluate deeper artistic meaning in generative art and aligns better with human judgments than prior metrics.
-
WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval
WRF4CIR uses weight-regularized fine-tuning with adversarial perturbations to mitigate overfitting in composed image retrieval and narrows the generalization gap on benchmarks.
-
EcoAssist: Embedding Sustainability into AI-Assisted Frontend Development
EcoAssist embeds energy estimation and optimization into AI-assisted frontend coding, reducing website energy use by 13-16% in benchmarks while preserving developer productivity.
-
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation
HaNoRec dynamically weights harder preference samples and applies Gaussian perturbations to output distributions to improve multimodal LLM performance on sequential recommendation tasks.
-
Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference
VIBES uses Bayesian inference to trigger focused VLM reasoning on localized far-field regions in expressway videos, improving anomaly detection accuracy and efficiency.
- Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning
- Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
- Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
- Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings