Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
hub
Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 10verdicts
UNVERDICTED 10roles
dataset 1polarities
background 1representative citing papers
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
ProDiG progressively transforms aerial Gaussian splats into coherent ground-level 3D reconstructions via diffusion guidance and specialized attention modules.
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
Coarse-to-fine 1D token sequences in autoregressive models enable stronger test-time search and even training-free text-to-image generation guided by verifiers, outperforming traditional 2D grid tokenization.
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.
SyncFix improves 3D reconstructions by synchronizing multi-view latent representations in a diffusion refinement process, generalizing from pair-wise training to arbitrary view counts at inference.
ID-Sim is a new similarity metric that aims to capture human selective sensitivity to identities by training on curated real and generative synthetic data and validating against human annotations on recognition, retrieval, and generative tasks.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
citing papers explorer
-
Evaluating Remote Sensing Image Captions Beyond Metric Biases
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
-
ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
ProDiG progressively transforms aerial Gaussian splats into coherent ground-level 3D reconstructions via diffusion guidance and specialized attention modules.
-
Stylistic Attribute Control in Latent Diffusion Models
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
-
(1D) Ordered Tokens Enable Efficient Test-Time Search
Coarse-to-fine 1D token sequences in autoregressive models enable stronger test-time search and even training-free text-to-image generation guided by verifiers, outperforming traditional 2D grid tokenization.
-
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.
-
SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization
SyncFix improves 3D reconstructions by synchronizing multi-view latent representations in a diffusion refinement process, generalizing from pair-wise training to arbitrary view counts at inference.
-
ID-Sim: An Identity-Focused Similarity Metric
ID-Sim is a new similarity metric that aims to capture human selective sensitivity to identities by training on curated real and generative synthetic data and validating against human annotations on recognition, retrieval, and generative tasks.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.