hub

Learning to prompt for vision-language models.Int

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu · 2021 · International Journal of Computer Vision · DOI 10.1007/s11263-022-01653-1 · arXiv 2109.01134

14 Pith papers cite this work, alongside 2,607 external citations. Polarity classification is still indexing.

14 Pith papers citing it

2,607 external citations · external index

open at publisher browse 14 citing papers arXiv PDF

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

PERL: Parameter Efficient Reasoning in CLIP Latent Space

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance across 15 benchmarks.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.

TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.

FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection

cs.CV · 2026-05-05 · unverdicted · novelty 6.0

FACTOR uses counterfactual image perturbations to quantify and suppress attribute-dependent predictions in open-vocabulary object detection, improving robustness on corrupted datasets without any training.

Unified Multimodal Brain Decoding via Cross-Subject Soft-ROI Fusion

cs.LG · 2025-12-23 · unverdicted · novelty 6.0

BrainROI achieves leading cross-subject brain-captioning results on NSD by combining multi-atlas soft-ROI fusion with interpretable prompt optimization.

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.

Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

cs.MM · 2026-04-22 · unverdicted · novelty 5.0

A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.

Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models

cs.CV · 2026-04-07 · unverdicted · novelty 5.0

VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.

DetailCLIP: Injecting Image Details into CLIP's Feature Space

cs.CV · 2022-08-31 · unverdicted · novelty 5.0

A patch-based fusion method extends CLIP to high-resolution images by retaining multi-scale details for improved class-prompted retrieval.

Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

cs.CV · 2026-05-13 · unverdicted · novelty 4.0

Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.

ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification

cs.LG · 2026-04-20 · unverdicted · novelty 4.0

ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.

citing papers explorer

Showing 14 of 14 citing papers.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion cs.CV · 2022-08-02 · unverdicted · none · ref 34
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
PERL: Parameter Efficient Reasoning in CLIP Latent Space cs.CV · 2026-05-18 · unverdicted · none · ref 38
PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance across 15 benchmarks.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media cs.CL · 2026-05-16 · unverdicted · none · ref 178
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model cs.CV · 2026-05-04 · unverdicted · none · ref 16
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning cs.CV · 2026-05-12 · unverdicted · none · ref 33 · 2 links
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection cs.CV · 2026-05-05 · unverdicted · none · ref 38
FACTOR uses counterfactual image perturbations to quantify and suppress attribute-dependent predictions in open-vocabulary object detection, improving robustness on corrupted datasets without any training.
Unified Multimodal Brain Decoding via Cross-Subject Soft-ROI Fusion cs.LG · 2025-12-23 · unverdicted · none · ref 12
BrainROI achieves leading cross-subject brain-captioning results on NSD by combining multi-atlas soft-ROI fusion with interpretable prompt optimization.
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency cs.CV · 2026-05-18 · unverdicted · none · ref 73
SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs cs.CV · 2026-05-07 · unverdicted · none · ref 14
GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction cs.MM · 2026-04-22 · unverdicted · none · ref 81
A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models cs.CV · 2026-04-07 · unverdicted · none · ref 6
VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.
DetailCLIP: Injecting Image Details into CLIP's Feature Space cs.CV · 2022-08-31 · unverdicted · none · ref 32
A patch-based fusion method extends CLIP to high-resolution images by retaining multi-scale details for improved class-prompted retrieval.
Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation cs.CV · 2026-05-13 · unverdicted · none · ref 43
Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.
ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification cs.LG · 2026-04-20 · unverdicted · none · ref 12
ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.

Learning to prompt for vision-language models.Int

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer