FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

Chang Kong; Haigang Zhang; Peng Mo; Qiuming Luo; Yuebing Li

arxiv: 2606.07645 · v1 · pith:4ORJXWLSnew · submitted 2026-06-02 · 💻 cs.CV · cs.AI

FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

Chang Kong , Yuebing Li , Peng Mo , Haigang Zhang , Qiuming Luo This is my paper

Pith reviewed 2026-06-28 11:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords fine-grained perceptionhard negative samplesvision-language datasetsmulti-agent frameworkdataset constructionImageNetFG-OVD benchmarkVLM fine-tuning

0 comments

The pith

FineGen uses a multi-agent pipeline to generate valid hard negative image-text pairs that raise fine-grained VLM accuracy 14.4 percent on hard samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that scarcity of hard negatives in vision-language datasets limits fine-grained perception and that an automated framework can fix this. FineGen runs a Generation-Verification-Correction loop with closed-loop feedback to produce negatives that are semantically valid yet contradict the image content. Applied to ImageNet, the method yields FineGen-100K containing more than 147,000 attribute-specific hard negatives at a 1:10 positive-to-negative ratio and a measured 96.7 percent validity rate. Fine-tuning vision-language models on this dataset produces a 14.4 percent accuracy lift on hard samples of the FG-OVD benchmark and outperforms prior approaches.

Core claim

FineGen is a VLM-based multi-agent framework that runs a collaborative Generation-Verification-Correction pipeline with closed-loop feedback to synthesize semantically valid yet visually contradictory hard negatives; when applied to ImageNet it produces the FineGen-100K dataset of over 147,000 attribute-specific hard negatives at a 1:10 ratio with 96.7 percent validity, and fine-tuning on this data delivers a 14.4 percent accuracy gain on hard samples of the FG-OVD benchmark while outperforming state-of-the-art methods.

What carries the argument

The collaborative Generation-Verification-Correction pipeline with closed-loop feedback mechanism that enforces semantic validity while ensuring visual contradiction.

If this is right

Fine-tuning on FineGen-100K improves accuracy on hard samples of fine-grained vision-language benchmarks.
The 1:10 positive-to-negative ratio supplies sufficient contrastive signal for training.
The 96.7 percent attribute validity rate indicates that most synthesized negatives remain usable without manual filtering.
The method outperforms existing dataset-construction approaches on the same downstream benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be run on other image collections to produce domain-specific hard-negative sets.
Models trained this way may generalize better to real-world scenes that contain subtle mismatches between text and image.
The closed-loop correction step reduces the human effort needed to curate fine-grained training data.

Load-bearing premise

The generated hard negatives are both semantically valid and genuinely difficult for models in a way that transfers to downstream tasks.

What would settle it

A controlled experiment that fine-tunes the same base VLM on FineGen-100K versus a matched random-negative control set and measures whether the 14.4 percent accuracy gain on FG-OVD hard samples disappears.

Figures

Figures reproduced from arXiv: 2606.07645 by Chang Kong, Haigang Zhang, Peng Mo, Qiuming Luo, Yuebing Li.

**Figure 1.** Figure 1: Overview of the FineGen Framework. The system automates fine-grained dataset construction via a VLM-based Multi-Agent System. The pipeline is horizontally decoupled into three main stages: Phase I synthesizes a visually-grounded global positive description (T +); Phase II conducts visual-aware mining to generate a collection of hierarchical hard negatives (T −); and Phase III extracts atomic sub-text uni… view at source ↗

**Figure 2.** Figure 2: Qualitative Visualization of FineGen-100K. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

The scarcity of hard negative samples in current vision-language datasets significantly hinders fine-grained perception. To address this, we propose FineGen, a VLM-based Multi-Agent framework for automated dataset construction. By employing a collaborative Generation-Verification-Correction pipeline with a closed-loop feedback mechanism, FineGen ensures synthesized hard negatives are semantically valid yet strictly contradictory to visual content. Applying this to ImageNet, we construct FineGen-100K, a hierarchical dataset containing over 147,000 attribute-specific hard negatives with a rigorous 1:10 positive-to-negative ratio. Extensive evaluations confirm a 96.7% attribute validity rate. Crucially, downstream validation on the FG-OVD benchmark shows that fine-tuning on FineGen-100K yields a substantial +14.4% accuracy improvement on hard samples, significantly outperforming state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FineGen's closed-loop multi-agent pipeline for generating attribute-specific hard negatives is a practical construction method, but the +14.4% claim on hard samples rests on underspecified selection and measurement.

read the letter

The paper's main output is a VLM multi-agent system that runs Generation, Verification, and Correction stages in a feedback loop to produce hard negative captions for ImageNet images. The negatives are meant to be semantically coherent yet visually contradictory, and the authors scale this to FineGen-100K with a 1:10 positive-negative ratio and over 147k attribute-specific examples.

The pipeline combination is the clearest new element. Prior work has used VLMs for captioning or simple negative sampling, but the explicit closed-loop correction among agents to enforce both validity and contradiction is a specific engineering choice not directly matched in the cited literature. The resulting dataset size and the reported 96.7% validity rate are concrete deliverables that could be reused.

The evaluation section is where the evidence thins out. The abstract states the 14.4% accuracy lift on hard samples in FG-OVD and the validity percentage, yet supplies no protocol for how validity was judged, which baselines were run, or how the hard-sample subset was defined and fixed before results were examined. The stress-test concern about post-hoc selection therefore stands on the given text; without a pre-specified criterion the size of the reported gain is difficult to interpret.

No fitted parameters or internal equations appear, so there is no circularity problem. The work is empirical dataset construction rather than theory.

This paper is aimed at researchers who build or augment vision-language training sets for fine-grained tasks. A reader already working on hard-negative mining or multi-agent data pipelines would extract usable details from the agent roles and output format.

It deserves peer review. The construction approach addresses a real practical bottleneck, and reviewers can request the missing measurement protocols and re-run checks. The idea is solid enough to justify the time even if the current numbers need tighter controls.

Referee Report

2 major / 1 minor

Summary. The paper proposes FineGen, a VLM-based multi-agent framework with a Generation-Verification-Correction pipeline and closed-loop feedback for constructing fine-grained image-text datasets. It applies the method to ImageNet to build FineGen-100K containing over 147,000 attribute-specific hard negatives at a 1:10 positive-to-negative ratio, reports a 96.7% attribute validity rate, and claims that fine-tuning on this dataset yields a +14.4% accuracy improvement on hard samples of the FG-OVD benchmark, outperforming state-of-the-art methods.

Significance. If the validity measurement and accuracy gains are shown to be robust under pre-specified protocols without post-hoc selection, the work would provide a practical automated method for addressing the scarcity of hard negatives in vision-language datasets, potentially improving fine-grained perception capabilities. The collaborative multi-agent pipeline with feedback is a constructive engineering contribution to dataset construction.

major comments (2)

[Abstract / §4 (Evaluation)] Abstract and evaluation section: The 96.7% attribute validity rate is presented as a key result, but no description is given of the measurement protocol, including how validity was judged (human raters? automated metrics?), inter-rater agreement, data splits, or whether the evaluation was pre-specified before inspecting results. This detail is load-bearing for the central claim that the generated negatives are semantically valid.
[Abstract / §5 (Downstream validation)] Abstract and FG-OVD results: The +14.4% accuracy improvement is reported specifically 'on hard samples' of FG-OVD, but no pre-specified criterion or selection protocol is stated for identifying those hard samples, nor is a comparison provided against the full FG-OVD test set. Without these controls the measured gain could be inflated by post-hoc subset selection, undermining the claim that the improvement is driven by the quality of FineGen-100K negatives.

minor comments (1)

[§3] The 1:10 positive-to-negative ratio is stated but the exact construction of the positive samples and how the hierarchical structure is enforced should be clarified with an example in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on evaluation protocols. The comments identify important omissions in the description of our validity assessment and downstream evaluation. We address each point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract / §4 (Evaluation)] Abstract and evaluation section: The 96.7% attribute validity rate is presented as a key result, but no description is given of the measurement protocol, including how validity was judged (human raters? automated metrics?), inter-rater agreement, data splits, or whether the evaluation was pre-specified before inspecting results. This detail is load-bearing for the central claim that the generated negatives are semantically valid.

Authors: We agree that the measurement protocol must be described in full. The 96.7% figure was obtained via human evaluation on a random sample of 500 generated negatives. Three annotators independently judged semantic validity and strict contradiction to the image; we will add the exact annotation guidelines, the resulting inter-rater agreement statistic, the sampling procedure, and confirmation that the protocol was fixed prior to result inspection. These details will be inserted as a new paragraph in §4. revision: yes
Referee: [Abstract / §5 (Downstream validation)] Abstract and FG-OVD results: The +14.4% accuracy improvement is reported specifically 'on hard samples' of FG-OVD, but no pre-specified criterion or selection protocol is stated for identifying those hard samples, nor is a comparison provided against the full FG-OVD test set. Without these controls the measured gain could be inflated by post-hoc subset selection, undermining the claim that the improvement is driven by the quality of FineGen-100K negatives.

Authors: We acknowledge the concern. The hard-sample subset was defined by a pre-specified threshold on baseline model error rates; we will state this criterion explicitly in §5 and add results on the complete FG-OVD test set so that readers can evaluate the gain without subset selection. The revised text will also clarify that the +14.4% figure is reported alongside the full-set numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with external benchmark evaluation

full rationale

The paper describes an empirical multi-agent pipeline for synthesizing hard-negative image-text pairs from ImageNet and evaluates the resulting FineGen-100K dataset on the external FG-OVD benchmark. No equations, fitted parameters, derivations, or self-citation chains appear in the provided text. The reported accuracy lift is measured against a fixed external test set rather than being defined in terms of the construction process itself. The work is therefore self-contained against external benchmarks with no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, derivations, or modeling choices, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5686 in / 1182 out tokens · 20013 ms · 2026-06-28T11:14:54.265432+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Bianchi, L., Carrara, F., Messina, N., Falchi, F.: Is CLIP the main road- block for fine-grained open-world perception? In: 2024 International Confer- ence on Content-Based Multimedia Indexing (CBMI). pp. 1–8 (Sep 2024). https://doi.org/10.1109/CBMI62980.2024.10859215

work page doi:10.1109/cbmi62980.2024.10859215 2024
[2]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Bianchi, L., Carrara, F., Messina, N., Gennaro, C., Falchi, F.: The devil is in the fine-grained details: evaluating open-vocabulary object detectors for fine-grained understanding. In: 2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 22520–22529 (Jun 2024). https://doi.org/10.1109/CVPR52733.2024.02125

work page doi:10.1109/cvpr52733.2024.02125 2024
[3]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: ShareGPT4V: improving large multi-modal models with better captions. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Com- puter Vision – ECCV 2024, vol. 15075, pp. 370–387. Springer Nature Switzerland, Cham (2025). https://doi.org/10.1007/978...

work page doi:10.1007/978-3-031-72643-9 2024
[4]

In: III, H.D., Singh, A

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (13–18 Jul 2020)

2020
[5]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An ad- vanced diagnostic suite for entangled language hallucination and visual illu- sion in large vision-language models. In: 2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 14375–14385 ...

work page doi:10.1109/cvpr52733.2024.01363 2024
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for un- supervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9726–9735 (2020). https://doi.org/10.1109/CVPR42600.2020.00975

work page doi:10.1109/cvpr42600.2020.00975 2020
[7]

Advances in Neural Information Processing Systems36, 31096–31116 (Dec 2023)

Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: SugarCrepe: fixing hackable benchmarks for vision-language compositionality. Advances in Neural Information Processing Systems36, 31096–31116 (Dec 2023)

2023
[8]

In: Proceedings of the 38th International Conference on Machine Learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning. pp. 4904–4916. PMLR (Jul 2021)

2021
[9]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Jing, D., He, X., Luo, Y., Fei, N., Yang, G., Wei, W., Zhao, H., Lu, Z.: FineCLIP: self-distilled region-based CLIP for better fine-grained understanding. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in neural information processing systems. vol. 37, pp. 27896–27918. Curran Associates, Inc. (2024...

work page doi:10.52202/079017-0875 2024
[10]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Li, G., Hammoud, H., Itani, H., Khizbullin, D., Ghanem, B.: Camel: Communicative agents for ”mind” exploration of large language model society. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 51991–52008. Curran Associates, Inc. (2023)

2023
[11]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object halluci- nation in large vision-language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 292–305. Association for Computational Linguistics, Singapore (Dec 2023). https://doi.org/10.18653...

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[12]

In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014, vol. 8693, pp. 740–

2014
[13]

Vortmeyer, S

Springer International Publishing, Cham (2014). https://doi.org/10.1007/978- 3-319-10602-1“˙48

work page doi:10.1007/978- 2014
[14]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 34892–34916. Curran Associates, Inc. (2023)

2023
[15]

Advances in Neural Information Processing Systems36, 34892–34916 (Dec 2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems36, 34892–34916 (Dec 2023)

2023
[16]

In: 2025 International Conference on Virtual Reality and Visualization (ICVRV) (2025) 14 C

Luo, Q., Mo, P., Li, F., Xiao, Z., Kong, C.: SubCLIP: Fine-grained visual-semantic alignment via subtext decomposition. In: 2025 International Conference on Virtual Reality and Visualization (ICVRV) (2025) 14 C. Kong et al

2025
[17]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural I...

2023
[18]

In: Proceedings of the 38th International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. pp. 8748–8763. PMLR (Jul 2021)

2021
[19]

In: International Conference on Learning Representations (Oct 2020)

Robinson, J.D., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (Oct 2020)

2020
[20]

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs (Nov 2021), http://arxiv.org/abs/ 2111.02114

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale (2023),https://arxiv.org/abs/2303.15389

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

High-resolution image synthesis with latent diffusion models

Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: probing vision and language models for visio-linguistic com- positionality. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5228–5238. IEEE, New Orleans, LA, USA (Jun 2022). https://doi.org/10.1109/CVPR52688.2022.00517

work page doi:10.1109/cvpr52688.2022.00517 2022
[23]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Wang, T., Lin, K., Li, L., Lin, C.C., Yang, Z., Zhang, H., Liu, Z., Wang, L.: Equivariant similarity for vision-language foundation models. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11998–12008 (2023)

2023
[24]

Wu, T.H., Lee, H., Ge, J., Gonzalez, J.E., Darrell, T., Chan, D.M.: Generate, but verify: Reducing hallucination in vision-language models with retrospective resampling (2025),https://arxiv.org/abs/2504.13169

work page arXiv 2025
[25]

Wu, W., Chen, Z., Huang, X., Ma, F., Xiao, J.: Synthseg-agents: Multi-agent synthetic data generation for zero-shot weakly supervised semantic segmentation (2025),https://arxiv.org/abs/2512.15310

work page arXiv 2025
[26]

Xuan, V.D., Vo, H., Murphy, D., Nguyen, H.D.: Agentsgen: Multi-agent llm in the loop for semantic collaboration and generation of synthetic data (2025), https: //arxiv.org/abs/2505.13466

work page arXiv 2025
[27]

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2023)

2023
[28]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Zhang, B., Zhang, P., Dong, X., Zang, Y., Wang, J.: Long-clip: Unlocking the long- text capability of clip. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 310–325. Springer Nature Switzerland, Cham (2025) A Appendix: Cost-Benefit Analysis In our framework, constructing the FineGen-10...

2024

[1] [1]

Bianchi, L., Carrara, F., Messina, N., Falchi, F.: Is CLIP the main road- block for fine-grained open-world perception? In: 2024 International Confer- ence on Content-Based Multimedia Indexing (CBMI). pp. 1–8 (Sep 2024). https://doi.org/10.1109/CBMI62980.2024.10859215

work page doi:10.1109/cbmi62980.2024.10859215 2024

[2] [2]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Bianchi, L., Carrara, F., Messina, N., Gennaro, C., Falchi, F.: The devil is in the fine-grained details: evaluating open-vocabulary object detectors for fine-grained understanding. In: 2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 22520–22529 (Jun 2024). https://doi.org/10.1109/CVPR52733.2024.02125

work page doi:10.1109/cvpr52733.2024.02125 2024

[3] [3]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: ShareGPT4V: improving large multi-modal models with better captions. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Com- puter Vision – ECCV 2024, vol. 15075, pp. 370–387. Springer Nature Switzerland, Cham (2025). https://doi.org/10.1007/978...

work page doi:10.1007/978-3-031-72643-9 2024

[4] [4]

In: III, H.D., Singh, A

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (13–18 Jul 2020)

2020

[5] [5]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An ad- vanced diagnostic suite for entangled language hallucination and visual illu- sion in large vision-language models. In: 2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 14375–14385 ...

work page doi:10.1109/cvpr52733.2024.01363 2024

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for un- supervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9726–9735 (2020). https://doi.org/10.1109/CVPR42600.2020.00975

work page doi:10.1109/cvpr42600.2020.00975 2020

[7] [7]

Advances in Neural Information Processing Systems36, 31096–31116 (Dec 2023)

Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: SugarCrepe: fixing hackable benchmarks for vision-language compositionality. Advances in Neural Information Processing Systems36, 31096–31116 (Dec 2023)

2023

[8] [8]

In: Proceedings of the 38th International Conference on Machine Learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning. pp. 4904–4916. PMLR (Jul 2021)

2021

[9] [9]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Jing, D., He, X., Luo, Y., Fei, N., Yang, G., Wei, W., Zhao, H., Lu, Z.: FineCLIP: self-distilled region-based CLIP for better fine-grained understanding. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in neural information processing systems. vol. 37, pp. 27896–27918. Curran Associates, Inc. (2024...

work page doi:10.52202/079017-0875 2024

[10] [10]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Li, G., Hammoud, H., Itani, H., Khizbullin, D., Ghanem, B.: Camel: Communicative agents for ”mind” exploration of large language model society. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 51991–52008. Curran Associates, Inc. (2023)

2023

[11] [11]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object halluci- nation in large vision-language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 292–305. Association for Computational Linguistics, Singapore (Dec 2023). https://doi.org/10.18653...

work page doi:10.18653/v1/2023.emnlp-main.20 2023

[12] [12]

In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014, vol. 8693, pp. 740–

2014

[13] [13]

Vortmeyer, S

Springer International Publishing, Cham (2014). https://doi.org/10.1007/978- 3-319-10602-1“˙48

work page doi:10.1007/978- 2014

[14] [14]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 34892–34916. Curran Associates, Inc. (2023)

2023

[15] [15]

Advances in Neural Information Processing Systems36, 34892–34916 (Dec 2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems36, 34892–34916 (Dec 2023)

2023

[16] [16]

In: 2025 International Conference on Virtual Reality and Visualization (ICVRV) (2025) 14 C

Luo, Q., Mo, P., Li, F., Xiao, Z., Kong, C.: SubCLIP: Fine-grained visual-semantic alignment via subtext decomposition. In: 2025 International Conference on Virtual Reality and Visualization (ICVRV) (2025) 14 C. Kong et al

2025

[17] [17]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural I...

2023

[18] [18]

In: Proceedings of the 38th International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. pp. 8748–8763. PMLR (Jul 2021)

2021

[19] [19]

In: International Conference on Learning Representations (Oct 2020)

Robinson, J.D., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (Oct 2020)

2020

[20] [20]

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs (Nov 2021), http://arxiv.org/abs/ 2111.02114

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale (2023),https://arxiv.org/abs/2303.15389

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

High-resolution image synthesis with latent diffusion models

Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: probing vision and language models for visio-linguistic com- positionality. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5228–5238. IEEE, New Orleans, LA, USA (Jun 2022). https://doi.org/10.1109/CVPR52688.2022.00517

work page doi:10.1109/cvpr52688.2022.00517 2022

[23] [23]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Wang, T., Lin, K., Li, L., Lin, C.C., Yang, Z., Zhang, H., Liu, Z., Wang, L.: Equivariant similarity for vision-language foundation models. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11998–12008 (2023)

2023

[24] [24]

Wu, T.H., Lee, H., Ge, J., Gonzalez, J.E., Darrell, T., Chan, D.M.: Generate, but verify: Reducing hallucination in vision-language models with retrospective resampling (2025),https://arxiv.org/abs/2504.13169

work page arXiv 2025

[25] [25]

Wu, W., Chen, Z., Huang, X., Ma, F., Xiao, J.: Synthseg-agents: Multi-agent synthetic data generation for zero-shot weakly supervised semantic segmentation (2025),https://arxiv.org/abs/2512.15310

work page arXiv 2025

[26] [26]

Xuan, V.D., Vo, H., Murphy, D., Nguyen, H.D.: Agentsgen: Multi-agent llm in the loop for semantic collaboration and generation of synthetic data (2025), https: //arxiv.org/abs/2505.13466

work page arXiv 2025

[27] [27]

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2023)

2023

[28] [28]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Zhang, B., Zhang, P., Dong, X., Zang, Y., Wang, J.: Long-clip: Unlocking the long- text capability of clip. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 310–325. Springer Nature Switzerland, Cham (2025) A Appendix: Cost-Benefit Analysis In our framework, constructing the FineGen-10...

2024