hub Baseline reference

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta · 2021 · cs.CV · arXiv 2111.02114

Baseline reference. 64% of citing Pith papers use this work as a benchmark or comparison.

77 Pith papers citing it

Baseline 64% of classified citations

open full Pith review browse 77 citing papers arXiv PDF

abstract

Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 17 background 6 method 1 other 1

citation-polarity summary

use dataset 16 background 6 unclear 2 use method 1

claims ledger

abstract Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow

co-cited works

representative citing papers

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Watch Your Step: Information Injection in Diffusion Models via Shadow Timestep Embedding

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

Timestep embeddings in diffusion models function as a separable side channel that can carry dedicated information for adversarial injection or detection.

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

cs.CV · 2026-04-30 · unverdicted · novelty 7.0 · 2 refs

VeraRetouch is a 0.5B VLM-based framework with a differentiable Retouch Renderer and a new million-scale AetherRetouch-1M+ dataset that claims state-of-the-art results in reasoning photo retouching while enabling mobile deployment.

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

DifFoundMAD improves differential morphing attack detection by replacing traditional embeddings with those from vision foundation models and applying class-balanced lightweight fine-tuning, cutting high-security error rates from 6.16% to 2.17%.

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.

Distance Comparison Operations Are Not Silver Bullets in Vector Similarity Search: A Benchmark Study on Their Merits and Limits

cs.DB · 2026-04-03 · accept · novelty 7.0

Benchmark study shows DCO methods for vector similarity search are not reliable silver bullets due to high sensitivity to data properties and hardware, making them unsuitable for production deployment.

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

cs.CV · 2026-01-07 · unverdicted · novelty 7.0 · 2 refs

LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.

Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

cs.LG · 2025-09-25 · unverdicted · novelty 7.0

Introduces the first active learning framework for unaligned multimodal data that selects alignments using uncertainty and diversity to cut annotation costs by up to 40% on benchmarks while preserving accuracy.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

cs.CV · 2024-03-08 · unverdicted · novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

Language Is Not All You Need: Aligning Perception with Language Models

cs.CL · 2023-02-27 · conditional · novelty 7.0

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

cs.CV · 2023-01-30 · unverdicted · novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.

LAION-5B: An open large-scale dataset for training next generation image-text models

cs.CV · 2022-10-16 · accept · novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

Phenaki: Variable Length Video Generation From Open Domain Textual Description

cs.CV · 2022-10-05 · unverdicted · novelty 7.0

Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

cs.CV · 2022-05-23 · accept · novelty 7.0

Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

CPC-VAR adds Gradient-based Concept Neuron Selection for continual single-concept learning and a context-aware multi-branch composition strategy to reduce forgetting and entanglement in VAR-based personalized image generation.

PipeANN-Filter: An Efficient Filtered Vector Search System on SSD

cs.OS · 2026-05-18 · unverdicted · novelty 6.0

PipeANN-Filter improves filtered vector search latency and throughput on SSD by exploring a superset of valid vectors identified via probabilistic filters and verifying attributes only after selecting top-k candidates.

citing papers explorer

Showing 50 of 77 citing papers.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence cs.CL · 2026-05-13 · accept · none · ref 33 · internal anchor
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion cs.CV · 2022-08-02 · unverdicted · none · ref 28 · internal anchor
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media cs.CL · 2026-05-16 · unverdicted · none · ref 172 · internal anchor
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
Watch Your Step: Information Injection in Diffusion Models via Shadow Timestep Embedding cs.LG · 2026-05-01 · unverdicted · none · ref 26 · internal anchor
Timestep embeddings in diffusion models function as a separable side channel that can carry dedicated information for adversarial injection or detection.
VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching cs.CV · 2026-04-30 · unverdicted · none · ref 3 · 2 links · internal anchor
VeraRetouch is a 0.5B VLM-based framework with a differentiable Retouch Renderer and a new million-scale AetherRetouch-1M+ dataset that claims state-of-the-art results in reasoning photo retouching while enabling mobile deployment.
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training cs.CV · 2026-04-21 · unverdicted · none · ref 18 · internal anchor
EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection cs.CV · 2026-04-20 · unverdicted · none · ref 55 · internal anchor
DifFoundMAD improves differential morphing attack detection by replacing traditional embeddings with those from vision foundation models and applying class-balanced lightweight fine-tuning, cutting high-security error rates from 6.16% to 2.17%.
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 45 · internal anchor
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space cs.LG · 2026-04-03 · unverdicted · none · ref 38 · internal anchor
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
Distance Comparison Operations Are Not Silver Bullets in Vector Similarity Search: A Benchmark Study on Their Merits and Limits cs.DB · 2026-04-03 · accept · none · ref 31 · internal anchor
Benchmark study shows DCO methods for vector similarity search are not reliable silver bullets due to high sensitivity to data properties and hardware, making them unsuitable for production deployment.
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models cs.CV · 2026-01-07 · unverdicted · none · ref 49 · 2 links · internal anchor
LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data cs.LG · 2025-09-25 · unverdicted · none · ref 7 · internal anchor
Introduces the first active learning framework for unaligned multimodal data that selects alignments using uncertainty and diversity to cut annotation costs by up to 40% on benchmarks while preserving accuracy.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 46 · internal anchor
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment cs.CV · 2024-03-08 · unverdicted · none · ref 48 · internal anchor
ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 183 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention cs.CV · 2023-03-28 · conditional · none · ref 72 · internal anchor
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Language Is Not All You Need: Aligning Perception with Language Models cs.CL · 2023-02-27 · conditional · none · ref 23 · internal anchor
Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models cs.CV · 2023-01-30 · unverdicted · none · ref 10 · internal anchor
BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.
LAION-5B: An open large-scale dataset for training next generation image-text models cs.CV · 2022-10-16 · accept · none · ref 72 · internal anchor
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
Phenaki: Variable Length Video Generation From Open Domain Textual Description cs.CV · 2022-10-05 · unverdicted · none · ref 41 · internal anchor
Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding cs.CV · 2022-05-23 · accept · none · ref 61 · internal anchor
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 97 · internal anchor
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models cs.CV · 2026-05-19 · unverdicted · none · ref 34 · internal anchor
CPC-VAR adds Gradient-based Concept Neuron Selection for continual single-concept learning and a context-aware multi-branch composition strategy to reduce forgetting and entanglement in VAR-based personalized image generation.
PipeANN-Filter: An Efficient Filtered Vector Search System on SSD cs.OS · 2026-05-18 · unverdicted · none · ref 33 · internal anchor
PipeANN-Filter improves filtered vector search latency and throughput on SSD by exploring a superset of valid vectors identified via probabilistic filters and verifying attributes only after selecting top-k candidates.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 70 · internal anchor
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics cs.CV · 2026-04-27 · conditional · none · ref 11 · internal anchor
CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted trade-off in original task performance.
Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding cs.CV · 2026-04-21 · unverdicted · none · ref 36 · internal anchor
A minimally modified vanilla Transformer called Volt achieves state-of-the-art 3D semantic and instance segmentation by using volumetric tokens, 3D rotary embeddings, and a data-efficient training recipe that scales better than domain-specific backbones.
CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation cs.CV · 2026-03-26 · unverdicted · none · ref 39 · internal anchor
CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over prior distillation methods.
Vision Transformers Need More Than Registers cs.CV · 2026-02-25 · unverdicted · none · ref 28 · internal anchor
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.
Evaluating Concept Filtering Defenses against Child Sexual Abuse Material Generation by Text-to-Image Models cs.CR · 2025-12-05 · unverdicted · none · ref 40 · internal anchor
Concept filtering of child images from training data offers only limited protection against CSAM generation in text-to-image models, as prompting strategies and fine-tuning can bypass filters even when most child images are removed.
SkyReels-Text: Fine-Grained Font-Controllable Text Editing for Poster Design cs.CV · 2025-11-17 · unverdicted · none · ref 30 · internal anchor
SkyReels-Text enables simultaneous fine-grained editing of multiple text regions in posters using arbitrary glyph patches for font control without labels or test-time fine-tuning.
The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models cs.CV · 2025-11-14 · unverdicted · none · ref 28 · internal anchor
Diffusion models show distinct patterns of recognizing versus replicating culturally iconic references, with recognition linked to data frequency, textual uniqueness, popularity, and creation date rather than simple copying.
DeepSeek-OCR: Contexts Optical Compression cs.CV · 2025-10-21 · unverdicted · none · ref 31 · internal anchor
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing cs.CV · 2025-06-23 · unverdicted · none · ref 22 · internal anchor
CPAM proposes a context-preserving adaptive manipulation method for zero-shot real image editing in diffusion models via preservation adaptation and localized extraction modules, outperforming prior techniques on a new IMBA benchmark.
OpenVLA: An Open-Source Vision-Language-Action Model cs.RO · 2024-06-13 · unverdicted · none · ref 83 · internal anchor
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
BLINK: Multimodal Large Language Models Can See but Not Perceive cs.CV · 2024-04-18 · accept · none · ref 65 · internal anchor
BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions cs.CV · 2023-11-21 · conditional · none · ref 48 · internal anchor
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation cs.LG · 2023-10-19 · conditional · none · ref 134 · internal anchor
SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 62 · internal anchor
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Otter: A Multi-Modal Model with In-Context Instruction Tuning cs.CV · 2023-05-05 · unverdicted · none · ref 76 · internal anchor
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality cs.CL · 2023-04-27 · unverdicted · none · ref 11 · internal anchor
mPLUG-Owl introduces a two-stage modular training paradigm that aligns images with text in LLMs via frozen visual modules followed by LoRA fine-tuning, achieving strong multimodal instruction following.
EVA-CLIP: Improved Training Techniques for CLIP at Scale cs.CV · 2023-03-27 · conditional · none · ref 46 · internal anchor
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
SemDeDup: Data-efficient learning at web-scale through semantic deduplication cs.LG · 2023-03-16 · unverdicted · none · ref 35 · internal anchor
SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
Aligning Text-to-Image Models using Human Feedback cs.LG · 2023-02-23 · unverdicted · none · ref 18 · internal anchor
A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation cs.CV · 2022-06-22 · unverdicted · none · ref 43 · internal anchor
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 82 · internal anchor
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset cs.CV · 2026-05-19 · unverdicted · none · ref 41 · internal anchor
PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.
DiffMagicFace: Identity Consistent Facial Editing of Real Videos cs.CV · 2026-04-15 · unverdicted · none · ref 43 · internal anchor
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
Dynamic Eraser for Guided Concept Erasure in Diffusion Models cs.CV · 2026-04-13 · unverdicted · none · ref 27 · internal anchor
DSS is a lightweight inference-time framework that erases concepts in diffusion models at 91% average rate while preserving image fidelity, outperforming prior methods.
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning eess.SY · 2026-04-08 · unverdicted · none · ref 39 · internal anchor
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer