{"total":20,"items":[{"citing_arxiv_id":"2605.20777","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T06:17:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AttriStory adds a benchmark and AttriLoss-based latent optimization to improve faithful rendering of fine-grained attributes such as clothing color and texture in diffusion-model visual storytelling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20337","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Capability $\\neq$ Interpretability: Human Interpretability of Vision Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T18:00:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19032","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Personalized Face Privacy Protection From a Single Image","primary_cat":"cs.CV","submitted_at":"2026-05-18T18:56:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FaceCloak learns a lightweight identity-specific cloaking mask from a single image via synthetic face generation and iterative embedding perturbation to evade multiple recognition models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18324","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Improved Baselines with Representation Autoencoders","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:42:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":209,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"Visual Fidelity PSNR, SSIM [ 193], LPIPS [ 194], DreamSim [ 195], DINO [196], FVD [197] Physical Commonsense VideoPhy [198], PhyGenBench [199], VBench-2.0 [ 200], WorldModelBench [201] Physics-IQ [202], WorldScore [203], EWMBench [ 204] Action Plausibility WorldSimBench [205], Wow , wo, val! [206] Action Policy General MetaWorld [207], RLBench [ 208], Robomimic [209], Franka Kitchen [ 210], ManiSkill [ 211] ManiSkill2 [151], ManiSkill3 [ 212], RoboCasa [152], CAL VIN [213], VIMAbench [214] VLMbench [215], LIBERO [216], Libero-plus [ 4], Libero-pro [ 217], Libero-X [ 218] COLOSSEUM [219], AGNOSTOS [220], RoboEval [221], RoboVerse [222], PolaRiS [223] RoboMME [224], GenManip [ 225], VLABench [ 226], RoboSuite [227], RoboLab [228]"},{"citing_arxiv_id":"2605.11927","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T10:39:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"5, and the base coeffi- cients are set toc id = 1,c heat = 2,c s = 0.1, andc b = 0.1. Metrics. CLIP-Tmeasures text alignment using the av- erage CLIPScore [8]. For identity preservation,IDSim computes the average CLIP image similarity between each frame and the reference ID image. For temporal smooth- ness,CLIP-Imeasures adjacent-frame cosine similarity, whileDreamSim[5] reports the perceptual distance. To capture the trade-off central to our work, we intro- duce two sequence-aware measures on the sequence of per- frame CLIP features{f t}T t=1:Temporal Regularity(R t) andStorytelling Quality(S t).R t (lower is better) is de- signed to measurecoherenceby quantifying the smooth- ness of the feature trajectory. It penalizes temporal fluctu-"},{"citing_arxiv_id":"2605.02583","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stylistic Attribute Control in Latent Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-04T13:34:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22855","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Evaluating Remote Sensing Image Captions Beyond Metric Biases","primary_cat":"cs.CV","submitted_at":"2026-04-22T12:28:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15453","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"(1D) Ordered Tokens Enable Efficient Test-Time Search","primary_cat":"cs.CV","submitted_at":"2026-04-16T18:13:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Coarse-to-fine 1D token sequences in autoregressive models enable stronger test-time search and even training-free text-to-image generation guided by verifiers, outperforming traditional 2D grid tokenization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11797","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization","primary_cat":"cs.CV","submitted_at":"2026-04-13T17:58:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SyncFix improves 3D reconstructions by synchronizing multi-view latent representations in a diffusion refinement process, generalizing from pair-wise training to arbitrary view counts at inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08500","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Novel View Synthesis as Video Completion","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:44:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We evaluate our method on DL3DV-Benchmark [22] and Mip-NeRF 360 [2], which contain diverse real-world scenes with calibrated cameraposesandmulti-viewimages.Forfaircomparison,wefollowthetrain/test split from SEVA [55]. We report performance using standard image reconstruc- tion and perceptual metrics, including PSNR, SSIM, LPIPS [51], and Dream- Sim [9]. PSNR measures pixel-level fidelity, SSIM structural similarity, while LPIPS and DreamSim quantify perceptual similarity in learned feature spaces. Baselines.We compare against representative generative NVS approaches. (1) EscherNet [20], a scalable view synthesis model built upon image diffusion priors, which we fine-tune on the 10K scene-level multi-view dataset (DL3DV [22]) fol-"},{"citing_arxiv_id":"2604.05039","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ID-Sim: An Identity-Focused Similarity Metric","primary_cat":"cs.CV","submitted_at":"2026-04-06T18:00:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ID-Sim is a new similarity metric that aims to capture human selective sensitivity to identities by training on curated real and generative synthetic data and validating against human annotations on recognition, retrieval, and generative tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"quantifying its identity consistency. 2.2. Visual similarity metrics Perceptual metrics.SSIM [89], PSNR [27], and other clas- sical perceptual metrics [60, 88, 102] are hand-designed, and often fail to capture the complex nuances of hu- man perceptual similarity [103]. Alternatively, learning- based methods (e.g., LPIPS [103], PieAPP [54], Dream- Sim [18], DISTS [11]) show that embeddings from deep networks [37, 70] can be calibrated or trained on percep- tual judgments, and even align well with human perceptual judgments [103]. This observation extends to other modal- ities, such as stereo [79] and audio [43]. DiffSim [75] has also found that diffusion model features align well with hu- man judgments of perceptual similarity."},{"citing_arxiv_id":"2604.06061","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space","primary_cat":"cs.LG","submitted_at":"2026-04-03T17:00:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"several CLIP models. We note that the baselines optimize the CLIP similarity score directly. However, as is common practice, the CLIP model used for evalu- ation is different from the model used for optimization. Additionally, we report the BLIP [22] score, which measures text-image similarity via the normalized L2 distance. Finally, we consider DreamSim [10], a learned perceptual similarity 10 A. Buchnick et al. score that embeds both the target and generated images using a fused repre- sentation from pretrained vision models (e.g., CLIP [34], OpenCLIP [18], and DINO [6]) and compares the resulting embeddings to produce a similarity score aligned with human judgments. T able 1:Quantitative comparison across datasets."},{"citing_arxiv_id":"2604.02003","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-04-02T13:09:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProDiG progressively transforms aerial Gaussian splats into coherent ground-level 3D reconstructions via diffusion guidance and specialized attention modules.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"ages, with large changes in viewpoint from aerial to ground. We also evaluate our approach on a synthetic dataset: Ma- trix City[21]. For Matrix City, we focus specifically on the small city, as it includes both aerial and street views. Evaluation Metrics:We evaluate our method using two structural metrics, PSNR and SSIM [37] and two per- ceptual metrics, LPIPS [48] and DreamSim [7]. Among these, DreamSim is specifically designed to align quantita- tive evaluation more closely with human visual perception, making it our primary metric for assessing perceptual qual- ity. In addition to reconstruction accuracy, we also report the total number of Gaussians used in each method to quan- tify memory efficiency and scalability. Baselines:We compare our approach against several state-"},{"citing_arxiv_id":"2601.00090","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models","primary_cat":"cs.CV","submitted_at":"2025-12-31T19:47:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.19115","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?","primary_cat":"cs.CV","submitted_at":"2025-12-22T07:36:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MLLM representation spaces are dominated by textual semantics that reduce discriminative power for multimodal retrieval; a whitening transformation called ReAlign corrects the geometry and boosts zero-shot performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12598","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Setting the Stage: Text-Driven Scene-Consistent Image Generation","primary_cat":"cs.CV","submitted_at":"2025-12-14T08:35:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.09547","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning","primary_cat":"cs.CV","submitted_at":"2025-08-13T07:05:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17726","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM","primary_cat":"cs.CV","submitted_at":"2025-05-23T10:43:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.05160","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks","primary_cat":"cs.CV","submitted_at":"2024-10-07T16:14:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}