Gpt-4v (ision) as a general- ist evaluator for vision-language tasks

Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, Linda Ruth Petzold · 2023 · arXiv 2311.01361

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstructions on COCO and OpenImages while running faster than diffusion approaches.

T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

cs.CV · 2024-12-05 · unverdicted · novelty 7.0

T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

cs.CV · 2024-12-30 · unverdicted · novelty 6.0

VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.

GPT-4V(ision) is a Generalist Web Agent, if Grounded

cs.IR · 2024-01-03 · conditional · novelty 6.0

GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

cs.CY · 2026-04-02 · unverdicted · novelty 5.0 · 2 refs

Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.

citing papers explorer

Showing 5 of 5 citing papers.

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models cs.CV · 2026-04-16 · unverdicted · none · ref 51
Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstructions on COCO and OpenImages while running faster than diffusion approaches.
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts cs.CV · 2024-12-05 · unverdicted · none · ref 62
T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation cs.CV · 2024-12-30 · unverdicted · none · ref 34
VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.
GPT-4V(ision) is a Generalist Web Agent, if Grounded cs.IR · 2024-01-03 · conditional · none · ref 27
GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics cs.CY · 2026-04-02 · unverdicted · none · ref 132 · 2 links
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.

Gpt-4v (ision) as a general- ist evaluator for vision-language tasks

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer