{"total":17,"items":[{"citing_arxiv_id":"2605.21484","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fixed-Point Distillation constructs one-step correction targets for discrete diffusion generators via partial corruption and single teacher refinement, lifted into continuous features with a multi-bandwidth drift loss and straight-through estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21272","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset","primary_cat":"cs.CV","submitted_at":"2026-05-20T15:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"and product photography and is dominated by casual photography, a catch-all class for everyday photos. The full per-style distribution, including styles grouped under \"Other\", is reported in Fig. 32. 5 Downstream validation 5.1 Impact of multi-captioning To justify the decision to use multiple captioning models in MONET, we assess the impact of caption types on the performance of a T2I model. To do so we re-caption the ImageNet dataset [16] with four captioners of different complexity: BLIP2 [54], CogVLM22 , Florence2 [100] and ShareGPT4V [13]. We then train five T2I diffusion models, one per captioner and one with captions uniformly sampled from all four (Mix); other training details are provided in Appendix A.5. We report in Fig. 10 (left) the Long-CLIP alignment score [107] and (right) the Fréchet Inception Distance (FID) [30] computed on"},{"citing_arxiv_id":"2605.16949","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-16T12:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18868","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-15T12:28:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14333","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T03:57:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13010","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Amortized Guidance for Image Inpainting with Pretrained Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-13T05:02:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AID amortizes guidance for diffusion inpainting by training a reusable module via an auxiliary Gaussian formulation and continuous-time actor-critic algorithm, improving quality-speed trade-off with under 1% overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10765","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:59:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"˜e, selects t∗ by prototype similarity, and activates Gt∗ to synthesize Pi =G t∗(wi,u i). The frozen LLM then generates the response from[P i;z]. 5 Experiments 5.1 Implementation Details Datasets.We conduct experiments on two MCIT benchmarks. The first is CoIN [2], which consists of eight sequential VQA tasks: ScienceQA [29], TextVQA [35], ImageNet [5], GQA [16], VizWiz [12], 7 Table 2: Main results on the UCIT benchmark with LLaV A-v1.5-7B as the backbone (higher is better). The best and second-best values are marked inboldand underline , respectively. Methods ImgNet-R ArxivQA VizWiz IconQA CLEVR Flickr30k Average Zero-shot 16.27 53.73 38.39 19.20 20.63 41.88 - LoRA-FT [14] 58.03 77.63 44.39 67."},{"citing_arxiv_id":"2605.10661","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-11T14:43:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"across steps, rather than allocating separate parameters to each depth. This may reduce redundancy in the parameterization and provides a cleaner setting for analyzing how computation evolves over the course of inference. 3.3 Model training We perform image classification on ImageNet-1K, which contains approximately 1.28 million training images and 50,000 validation images from 1000 classes [6]. We trained ViT and bViT under the same training recipe to enable a controlled comparison between the standard and recurrent architectures. The recipe follows the established PyTorch ViT training implementation, with minor modifications for the recurrent setting [ 27]. In particular, we omit training components that introduce layer or step-specific stochasticity, such as stochastic depth and aggressive dropout, since these are not directly"},{"citing_arxiv_id":"2605.04358","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Intermediate Representations are Strong AI-Generated Image Detectors","primary_cat":"cs.CV","submitted_at":"2026-05-05T23:26:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28158","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists","primary_cat":"cs.AI","submitted_at":"2026-04-30T17:44:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-driven research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15711","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification","primary_cat":"cs.CV","submitted_at":"2026-04-17T05:32:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SSMamba uses a two-stage self-supervised pretraining and fine-tuning pipeline with Mamba-based components to outperform prior pathological foundation models on ROI and WSI classification tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14724","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet","primary_cat":"cs.CV","submitted_at":"2026-04-16T07:33:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.21045","ref_index":56,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LPNSR: Optimal Noise-Guided Diffusion Image Super-Resolution Via Learnable Noise Prediction","primary_cat":"cs.CV","submitted_at":"2026-03-22T03:52:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LPNSR derives optimal intermediate noise for diffusion SR via MLE and implements it with an LR-guided noise predictor, reaching SOTA perceptual quality in 4 steps without text priors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.15956","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors","primary_cat":"cs.RO","submitted_at":"2026-03-16T22:12:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.23532","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution","primary_cat":"cs.CV","submitted_at":"2025-12-29T15:09:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming prior scaling strategies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.15572","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers","primary_cat":"cs.CV","submitted_at":"2025-11-19T16:03:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.19519","ref_index":58,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift","primary_cat":"cs.CV","submitted_at":"2025-05-26T05:03:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes Lipschitz regularization during fine-tuning to prevent distributional drift in personalized diffusion models, improving subject fidelity and prompt adherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}