The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
The first benchmark for AI-generated scientific figure detection shows existing detectors fail in zero-shot transfer, overfit to specific generators, and break under common image corruptions.
MetaEvaluator meta-learns an initialization from reference models to enable accurate, label-free performance estimation for unseen models across architectures and modalities.
Image meanings grow more context-dependent with semantic abstraction, requiring narrative grounding for accurate retrieval at higher levels.
SpatialGrammar provides a grid-based DSL and compiler that lets LLMs generate collision-free 3D indoor scenes more reliably than raw-coordinate or code-based approaches.
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
CFSR reframes shadow removal as a physics-constrained process using geometric and semantic priors from depth, DINO, CLIP, and frequency decoupling to achieve claimed state-of-the-art results.
RefVQA uses a query-centered reference graph and graph-guided difference aggregation to improve AI-generated video quality assessment by incorporating inter-video comparisons.
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
DIRECT uses a three-level multi-agent framework to solve video mashup creation as a multimodal coherency problem, outperforming baselines on a new benchmark.
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
A feature supervision approach using SigLIP 2 extracts multi-granularity vision-aligned text representations to supervise MM-DiT image branches, pushing the Pareto frontier for portrait generation across alignment, realism, and aesthetics.
OCCAM discovers open-set visual concepts, estimates causal contributions via object-level interventions on black-box vision models, and induces a global concept ontology from aggregated dataset evidence.
Generative Visual Grounding creates instance-specific visual proxy images from EEG signals to enhance MLLM understanding of brain activity beyond text-only alignment.
Introduces Eulerian motion guidance with bidirectional geometric consistency to improve training speed and temporal quality in diffusion-based image animation.
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
ArtifactWorld restores artifacts in 3D Gaussian Splatting by training a video diffusion backbone on 107.5K paired clips with an isomorphic predictor for artifact heatmaps and an Artifact-Aware Triplet Fusion mechanism to achieve better sparse-view novel synthesis.
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
citing papers explorer
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
-
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Introduces Eulerian motion guidance with bidirectional geometric consistency to improve training speed and temporal quality in diffusion-based image animation.