Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts

Yan Zeng, Xinsong Zhang, Hang Li · 2021 · arXiv 2111.08276

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

cs.CV · 2026-03-10 · unverdicted · novelty 7.0

WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.

Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

cs.CV · 2025-03-18 · unverdicted · novelty 7.0

Stitch-a-Demo is a retrieval-based method that assembles visually coherent video demonstrations from multistep textual descriptions by training on weakly supervised procedural data with hard negatives.

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

cs.CV · 2024-12-11 · unverdicted · novelty 7.0

CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.

ViperGPT: Visual Inference via Python Execution for Reasoning

cs.CV · 2023-03-14 · unverdicted · novelty 7.0

ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

cs.LG · 2023-10-01 · conditional · novelty 6.0

LURE reduces object hallucination in LVLMs by 23% via post-hoc revision informed by co-occurrence, uncertainty, and text position analysis.

citing papers explorer

Showing 6 of 6 citing papers.

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 67
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition cs.CV · 2026-03-10 · unverdicted · none · ref 46
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
Stitch-a-Demo: Video Demonstrations from Multistep Descriptions cs.CV · 2025-03-18 · unverdicted · none · ref 74
Stitch-a-Demo is a retrieval-based method that assembles visually coherent video demonstrations from multistep textual descriptions by training on weakly supervised procedural data with hard negatives.
The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding cs.CV · 2024-12-11 · unverdicted · none · ref 59
CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
ViperGPT: Visual Inference via Python Execution for Reasoning cs.CV · 2023-03-14 · unverdicted · none · ref 65
ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models cs.LG · 2023-10-01 · conditional · none · ref 20
LURE reduces object hallucination in LVLMs by 23% via post-hoc revision informed by co-occurrence, uncertainty, and text position analysis.

Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts

fields

years

verdicts

representative citing papers

citing papers explorer