Eva: Exploring the limits of masked visual represen- tation learning at scale

Fang, Y · 2022 · arXiv 2211.07636

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

method 2 baseline 1

citation-polarity summary

use method 2 baseline 1

representative citing papers

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

cs.CV · 2023-05-11 · conditional · novelty 7.0

Instruction tuning of BLIP-2 with an instruction-aware Query Transformer delivers state-of-the-art zero-shot performance on held-out vision-language datasets and strong finetuned results on downstream tasks.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

cs.CV · 2023-01-30 · unverdicted · novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.

XAMI -- A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images

cs.CV · 2024-06-25 · conditional · novelty 6.0

Introduces the XAMI benchmark dataset of 1000 annotated XMM-Newton images for artefact detection together with a hybrid CNN-transformer instance segmentation demonstration.

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

cs.CV · 2023-04-20 · conditional · novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, creative writing, and instruction following.

EVA-CLIP: Improved Training Techniques for CLIP at Scale

cs.CV · 2023-03-27 · conditional · novelty 6.0

EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

cs.CV · 2023-10-14 · unverdicted · novelty 5.0

MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.

Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?

cs.CV · 2025-08-14

citing papers explorer

Showing 8 of 8 citing papers.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning cs.CV · 2023-05-11 · conditional · none · ref 10
Instruction tuning of BLIP-2 with an instruction-aware Query Transformer delivers state-of-the-art zero-shot performance on held-out vision-language datasets and strong finetuned results on downstream tasks.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models cs.CV · 2023-01-30 · unverdicted · none · ref 5
BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.
XAMI -- A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images cs.CV · 2024-06-25 · conditional · none · ref 24
Introduces the XAMI benchmark dataset of 1000 annotated XMM-Newton images for artefact detection together with a hybrid CNN-transformer instance segmentation demonstration.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models cs.CV · 2023-04-20 · conditional · none · ref 7
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, creative writing, and instruction following.
EVA-CLIP: Improved Training Techniques for CLIP at Scale cs.CV · 2023-03-27 · conditional · none · ref 20
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 47
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning cs.CV · 2023-10-14 · unverdicted · none · ref 12
MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors? cs.CV · 2025-08-14 · unreviewed · ref 12

Eva: Exploring the limits of masked visual represen- tation learning at scale

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer