super hub Canonical reference

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Dongxu Li, Junnan Li, Silvio Savarese, Steven Hoi · 2023 · cs.CV · arXiv 2301.12597

Canonical reference. 75% of citing Pith papers cite this work as background.

133 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 133 citing papers more from Dongxu Li arXiv PDF

abstract

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 31 method 6 baseline 2 dataset 1

citation-polarity summary

background 30 use method 6 unclear 2 baseline 1 use dataset 1

claims ledger

abstract The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative

authors

Dongxu Li Junnan Li Silvio Savarese Steven Hoi

co-cited works

representative citing papers

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

cs.CV · 2026-06-29 · accept · novelty 8.0

MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

Vision-language models reach 98.4% accuracy on 3141 handwritten single-letter exam answers across 61 tests, with false-negative rate reduced to 0.58% via reference-solution prompting.

Towards One-to-Many Temporal Grounding

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Introduces OMTG benchmark with C-Acc and EtF1 metrics, a 56k dataset, and caption/temporal rewards, reaching 43.65% EtF1 SOTA on the new bench.

UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

cs.MM · 2026-04-16 · unverdicted · novelty 7.0

Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

Bottleneck Tokens for Unified Multimodal Retrieval

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.

LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization

cs.GR · 2026-01-08 · unverdicted · novelty 7.0

LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

cs.CV · 2024-10-22 · accept · novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

cs.RO · 2024-09-03 · conditional · novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

cs.CV · 2024-06-14 · unverdicted · novelty 7.0

Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.

3D-VLA: A 3D Vision-Language-Action Generative World Model

cs.CV · 2024-03-14 · unverdicted · novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

cs.CV · 2024-01-17 · conditional · novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.

LRM: Large Reconstruction Model for Single Image to 3D

cs.CV · 2023-11-08 · conditional · novelty 7.0

LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

cs.CV · 2023-10-23 · unverdicted · novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

cs.CV · 2023-09-28 · unverdicted · novelty 7.0

DreamGaussian creates high-quality textured 3D meshes from single-view images in 2 minutes via generative Gaussian Splatting with mesh extraction and UV refinement.

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

cs.RO · 2023-07-12 · unverdicted · novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

Evaluating Object Hallucination in Large Vision-Language Models

cs.CV · 2023-05-17 · accept · novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

Visual Instruction Tuning

cs.CV · 2023-04-17 · unverdicted · novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

citing papers explorer

Showing 50 of 97 citing papers after filters.

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models cs.CV · 2026-06-09 · unverdicted · none · ref 9 · internal anchor
Vision-language models reach 98.4% accuracy on 3141 handwritten single-letter exam answers across 61 tests, with false-negative rate reduced to 0.58% via reference-solution prompting.
Towards One-to-Many Temporal Grounding cs.CV · 2026-06-04 · unverdicted · none · ref 40 · internal anchor
Introduces OMTG benchmark with C-Acc and EtF1 metrics, a 56k dataset, and caption/temporal rewards, reaching 43.65% EtF1 SOTA on the new bench.
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning cs.CV · 2026-05-09 · unverdicted · none · ref 11 · internal anchor
UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery cs.MM · 2026-04-16 · unverdicted · none · ref 37 · internal anchor
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
Bottleneck Tokens for Unified Multimodal Retrieval cs.LG · 2026-04-13 · unverdicted · none · ref 13 · internal anchor
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 39 · internal anchor
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization cs.GR · 2026-01-08 · unverdicted · none · ref 23 · internal anchor
LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.
SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 69 · internal anchor
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models cs.CV · 2024-06-14 · unverdicted · none · ref 12 · internal anchor
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
3D-VLA: A 3D Vision-Language-Action Generative World Model cs.CV · 2024-03-14 · unverdicted · none · ref 32 · internal anchor
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models cs.CV · 2023-10-23 · unverdicted · none · ref 21 · internal anchor
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation cs.CV · 2023-09-28 · unverdicted · none · ref 111 · internal anchor
DreamGaussian creates high-quality textured 3D meshes from single-view images in 2 minutes via generative Gaussian Splatting with mesh extraction and UV refinement.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models cs.RO · 2023-07-12 · unverdicted · none · ref 127 · internal anchor
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Visual Instruction Tuning cs.CV · 2023-04-17 · unverdicted · none · ref 28 · internal anchor
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
ViperGPT: Visual Inference via Python Execution for Reasoning cs.CV · 2023-03-14 · unverdicted · none · ref 31 · internal anchor
ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.
The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models cs.CV · 2026-07-01 · unverdicted · none · ref 14 · internal anchor
Safety-aligned T2I diffusion models exhibit semantic collapse in text embeddings causing TIFA drops; SAGE regularization restores structured utility while retaining safety.
Streamlining Analysis and Design of Two-Dimensional Electronic Spectroscopy using Machine Learning physics.chem-ph · 2026-06-17 · unverdicted · none · ref 4 · internal anchor
A Gaussian mixture model is used to learn spectral densities from 2DES experiments, enabling extraction of vibronic couplings, spectral extrapolation, and optimized experiment selection across simulated and experimental systems.
Contrastive Action-Image Pre-training for Visuomotor Control cs.RO · 2026-06-15 · unverdicted · none · ref 8 · internal anchor
CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation cs.CV · 2026-06-15 · unverdicted · none · ref 15 · internal anchor
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation cs.LG · 2026-06-11 · unverdicted · none · ref 36 · internal anchor
DeepJEB++ expands a small seed set of jet engine brackets into 15,360 labeled 3D designs via 2D latent diffusion augmentation, VLM filtering, generative 3D lifting, and automated finite-element labeling.
VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning cs.CV · 2026-06-04 · unverdicted · none · ref 17 · internal anchor
VTI-CoT proposes a visual-textual interleaved chain-of-thought method for video reasoning, built via automated annotation and OCR compression, claiming SOTA performance and better training efficiency on same-scale models.
Organizational Control Layer: Governance Infrastructure at the Execution Boundary of LLM Agent Systems cs.MA · 2026-06-03 · unverdicted · none · ref 160 · internal anchor
OCL is a governance layer for LLM agents that cuts unsafe executions from 88% to near-zero and raises valid success from 12% to 96% in adversarial buyer-seller negotiations across frontier LLMs.
Self-Prophetic Decoding to Unlock Visual Search in LVLMs cs.CV · 2026-05-27 · unverdicted · none · ref 11 · internal anchor
SeProD is a plug-and-play self-prophetic decoding framework that combines pre- and post-training LVLM capabilities via probability-based sampling to improve coherent visual search and multi-step reasoning.
Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs cs.CV · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
Proposes the first unified incomplete video-language model that processes missing modalities and serves as a plug-and-play module to boost existing VLMs on multi-modal tasks.
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation cs.CV · 2026-05-20 · unverdicted · none · ref 14 · internal anchor
UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.
From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG cs.CL · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
GranuRAG retrieves visual elements as first-class units in multimodal RAG via detection, cross-modal alignment, and attribution-constrained generation, improving performance up to 29.2% on the new GranuVistaVQA benchmark.
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation cs.CV · 2026-05-14 · unverdicted · none · ref 23 · internal anchor
StyleTextGen proposes a dual-branch style encoder, text style consistency loss, and mask-guided inference to achieve superior style consistency and cross-lingual performance in multilingual scene text generation on a new bilingual benchmark.
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology cs.CV · 2026-05-05 · unverdicted · none · ref 11 · internal anchor
MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models eess.IV · 2026-04-24 · unverdicted · none · ref 25 · internal anchor
Fine-tuned multimodal LLMs predict mouse social dominance from raw tube test videos with high agreement to traditional rankings.
ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams cs.AI · 2026-04-17 · unverdicted · none · ref 1 · internal anchor
ReactBench benchmark shows MLLMs suffer over 30% performance drop on complex topological reasoning tasks versus basic ones when evaluated on chemical reaction diagrams.
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning cs.CV · 2026-04-16 · unverdicted · none · ref 25 · internal anchor
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning cs.CV · 2026-04-03 · unverdicted · none · ref 35 · internal anchor
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
UniRec: Unified Multimodal Encoding for LLM-Based Recommendations cs.IR · 2026-01-27 · unverdicted · none · ref 7 · internal anchor
UniRec unifies heterogeneous recommendation modalities via specialized encoders, triplet representations, and hierarchical modeling to outperform prior multimodal LLM recommenders by up to 15% on benchmarks.
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models cs.CV · 2025-12-11 · unverdicted · none · ref 10 · internal anchor
Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
A cross-species neural foundation model for end-to-end speech decoding cs.CL · 2025-11-21 · unverdicted · none · ref 8 · internal anchor
A cross-species pretrained neural encoder combined with end-to-end training and audio LLMs reduces word error rate in neural speech decoding from 24.69% to 10.22% while aligning attempted and imagined speech.
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models cs.CV · 2025-11-13 · unverdicted · none · ref 9 · internal anchor
RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefics2, InstructBLIP and Qwen2.5-VL with >96% throughput.
Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 16 · internal anchor
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks cs.CV · 2025-07-02 · unverdicted · none · ref 38 · internal anchor
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion cs.CV · 2025-03-08 · unverdicted · none · ref 20 · internal anchor
RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA with transfer to other models.
MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts cs.CL · 2024-11-22 · unverdicted · none · ref 22 · internal anchor
MolReFlect introduces a teacher-student framework that automatically creates fine-grained molecule-text alignments to achieve SOTA results on molecule-caption translation.
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance cs.CV · 2024-11-04 · unverdicted · none · ref 10 · internal anchor
PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval cs.CV · 2024-10-24 · unverdicted · none · ref 16 · internal anchor
Presents ChatSearch dataset and ChatSearcher generative model for conversational image retrieval on open-domain images, claiming superior performance on the new dataset and competitive results elsewhere.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 29 · internal anchor
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation cs.CV · 2024-05-29 · unverdicted · none · ref 38 · internal anchor
SketchDeco performs training-free sketch colourisation via diffusion inversion to insert user colors followed by custom self-attention blending for local fidelity and global harmony.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 65 · internal anchor
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation cs.CV · 2024-02-24 · unverdicted · none · ref 52 · internal anchor
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
InstantID: Zero-shot Identity-Preserving Generation in Seconds cs.CV · 2024-01-15 · unverdicted · none · ref 9 · internal anchor
InstantID enables zero-shot identity-preserving image generation from one facial image via a novel IdentityNet that combines strong semantic and weak spatial conditioning with text prompts in diffusion models.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 105 · internal anchor
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory cs.CV · 2023-08-16 · unverdicted · none · ref 14 · internal anchor
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models cs.CV · 2023-08-13 · unverdicted · none · ref 31 · internal anchor
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer