arxiv: 2604.12357 · v1 · submitted 2026-04-14 · 💻 cs.AI · cs.CV

Recognition: unknown

ReflectCAP: Detailed Image Captioning with Reflective Memory

Kyungmin Min , Minbeom Kim , Kang-il Lee , Seunghyun Yoon , Kyomin Jung

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:05 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords detailed image captioningreflective noteslarge vision-language modelsfactualitycoveragemulti-agent analysisstructured reflectionCapArena-Auto

0 comments

The pith

ReflectCAP distills patterns of what vision-language models hallucinate or overlook into reusable notes that steer them toward more factual and complete image captions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reflective Note-Guided Captioning, called ReflectCAP, to resolve the conflict between making detailed image captions factually accurate and covering all relevant details. A multi-agent analysis first identifies the specific things a target large vision-language model tends to invent or miss, then condenses those findings into structured reflection notes. At generation time the notes guide the model on both what to avoid and what to emphasize, producing better captions across multiple tested models. This matters because detailed captions support applications like accessibility tools and image search, yet prior approaches either sacrifice accuracy for detail or require heavy extra computation.

Core claim

ReflectCAP uses a multi-agent pipeline to analyze consistent hallucination and oversight patterns in a target LVLM, distills the patterns into reusable Structured Reflection Notes, and applies those notes during inference to direct the model along both the avoidance and attention axes, resulting in captions that jointly advance factuality and coverage while lowering compute overhead relative to scaling or other multi-agent baselines.

What carries the argument

Structured Reflection Notes: reusable guidelines distilled from multi-agent analysis of a given LVLM's hallucination and oversight patterns that tell the model what to avoid and what to attend to during caption generation.

If this is right

ReflectCAP reaches the Pareto frontier of the factuality-coverage trade-off across eight tested LVLMs.
It produces substantial gains on head-to-head CapArena-Auto evaluations against strong reference models.
The method achieves higher caption quality at lower compute cost than model scaling.
It avoids the 21 to 36 percent extra overhead incurred by existing multi-agent pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same error-pattern distillation process could be applied to other vision-language tasks where models repeatedly hallucinate or omit elements.
Once created, the notes might transfer across different images or even different base models without retraining.
This points toward a general strategy of converting observed model weaknesses into lightweight, reusable instructions rather than increasing model size or inference rounds.

Load-bearing premise

The hallucination and oversight patterns found by the multi-agent analysis remain consistent enough across images to be captured in reusable notes that improve results without creating new errors or biases.

What would settle it

Run the notes on a fresh held-out image set and measure that captions produced with the notes show no gain or a loss in combined factuality-coverage scores compared with the same model without the notes.

Figures

Figures reproduced from arXiv: 2604.12357 by Kang-il Lee, Kyomin Jung, Kyungmin Min, Minbeom Kim, Seunghyun Yoon.

**Figure 1.** Figure 1: Overview of ReflectCAP. In the offline phase, a multi-agent reflective learning pipeline distills a target LVLM’s recurring captioning errors and omissions into Structured Reflection Notes. In the online phase, these notes guide caption generation for new images, producing captions that better balance factuality and coverage. image. This limitation is widely attributed to the tendency of language priors t… view at source ↗

**Figure 2.** Figure 2: ReflectCAP framework. In the offline phase, a multi-agent pipeline analyzes a small exemplar set to distill recurring errors and omissions of the target LVLM into Structured Reflection Notes. In the online phase, these notes guide caption generation: Avoid Notes suppress hallucinations, Include Notes encourage missing details, and a final merge integrates grounded and detail-focused captions into the final… view at source ↗

**Figure 3.** Figure 3: Solid and dash-dotted lines denote improvements from zero-shot to ReflectCAP and CapMAS, respectively. ReflectCAP achieves higher F1 scores while requiring 21–36% less compute than CapMAS. Light dashed lines denote performance gains from model parameter scaling. Compared to simply increasing model size, ReflectCAP achieves comparable quality at up to 8× lower compute cost, enabling high-quality, detailed … view at source ↗

**Figure 4.** Figure 4: Factuality comparison between Zero-shot and Grounded Base Caption across all models. Models with stronger instruction-following capabilities show larger gains [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation on note construction parameters. (a) F1 vs. the number of exemplar images N. Performance saturates at N≈30, indicating that systematic error patterns can be surfaced from a modest exemplar set. (b) F1 vs. the maximum number of note items per category K. Even K=1 already yields strong gains, with performance improving slightly further at K=5. plateauing around N = 30 and slightly declining at N = 1… view at source ↗

**Figure 6.** Figure 6: Case study of our pipeline. Top: Zero-shot Caption Middle: ReflectCAPBase suppresses hallucinations via Avoid notes. Bottom: ReflectCAP-Full recovers embossed text details guided by Include notes. Red denotes hallucinated expressions, blue denotes hallucination-corrected descriptions, and green denotes recovered finegrained details. sign appearance. By applying our hallucination avoidance patterns, Refle… view at source ↗

**Figure 7.** Figure 7: Success case. Error notes correct zero-shot hallucinations (red → green), and the extract-merge step successfully adds verifiable details (blue, ✓). goat. This illustrates that reflection notes can guide the model to attend to previously overlooked details, but whether this results in faithful descriptions or additional hallucinations depends on the model’s perceptual ability. Currently, verifying the fac… view at source ↗

**Figure 8.** Figure 8: Limitation case. Error notes correct zero-shot hallucinations (red → green), but the extract-merge step introduces a new spatial error (blue, ✗) when following a missing-detail note that exceeds the target model’s perceptual competence. exemplar images from Fashion-Gen for the offline phase and analyze how the resulting notes differ from those constructed on everyday images. Comparison of Structured Reflec… view at source ↗

**Figure 9.** Figure 9: Fashion domain qualitative examples. Zero-shot captions produce generic descriptions (e.g., “multiple zippers and buttons,” “a classic lapel”), while ReflectCAP generates domain-appropriate captions with precise garment construction vocabulary. Green denotes fashion-specific details recovered by the Structured Reflection Notes [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

read the original abstract

Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes -- what to avoid and what to attend to -- yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21--36\% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReflectCAP's idea of distilling model-specific errors into reusable reflection notes is distinct, but the paper does not show those notes generalize beyond the analysis images.

read the letter

The main point is that this work turns consistent LVLM mistakes into fixed guidelines that steer caption generation at inference time. The multi-agent pipeline spots hallucinations and oversights, condenses them into Structured Reflection Notes, and applies them to push both factuality and coverage without extra model calls at test time. That setup is not just another prompt trick and gets tested on eight models across GPT-4.1, Qwen, and InternVL lines. The reported outcome is a better quality-cost trade-off than scaling the base model or running heavier multi-agent systems, plus head-to-head wins on CapArena-Auto. Those are the concrete advances worth noting. The soft spot is the one flagged in the stress test. The notes are built from an analysis pass, yet the text gives no cross-validation on held-out images, no ablation on note specificity, and no check that the patterns survive distribution shift. If the distilled rules latch onto image-set artifacts instead of stable model behaviors, the Pareto gains and overhead savings will not transfer. The abstract also leaves out how factuality and coverage were scored, what statistical tests were used, and how image diversity was controlled, so the size of the claimed improvements is hard to judge from the given evidence. This paper is for people building practical detailed-caption systems who already have a target LVLM and want an inference-time patch rather than retraining. A reader focused on multimodal efficiency would find the pipeline worth trying. It deserves a serious referee because the core technique is new enough and the multi-model experiments are broad enough to merit external scrutiny, even if the generalizability section needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReflectCAP, a method for detailed image captioning that uses a multi-agent pipeline to identify consistent hallucinations and systematic oversights in target LVLMs, distills these patterns into reusable Structured Reflection Notes, and applies the notes at inference time to steer the model toward improved factuality and coverage. Experiments apply the approach to 8 LVLMs across GPT-4.1, Qwen, and InternVL families, claiming Pareto-frontier performance on the factuality-coverage trade-off, substantial gains on the CapArena-Auto head-to-head benchmark, and a superior quality-compute trade-off relative to model scaling or prior multi-agent pipelines (with 21-36% lower overhead).

Significance. If the generalizability of the Structured Reflection Notes holds and the evaluation details are supplied, the work could offer a practical, low-overhead route to higher-quality detailed captions without relying on larger models. The reported compute advantage over scaling and multi-agent baselines would be a meaningful contribution for resource-constrained deployment. At present, however, the absence of metric definitions, statistical controls, and transfer evidence limits the assessed impact.

major comments (2)

[Abstract] Abstract: the claims of Pareto-frontier performance and substantial gains on CapArena-Auto rest on unstated details of how factuality and coverage are measured, which baselines are used, whether statistical significance was assessed, the diversity of the image set, and any controls for bias in the multi-agent analysis that produced the notes.
[Method] Method and experimental sections: the central claim requires that hallucination/oversight patterns distilled into Structured Reflection Notes are consistent and reusable across images and models. No cross-validation, held-out image sets, or ablation on note specificity is described, leaving open the possibility that the notes encode analysis-set artifacts rather than model-invariant behaviors and that reported gains would not transfer.

minor comments (2)

[Abstract] Abstract: the relationship between the title's 'Reflective Memory' and the body term 'Structured Reflection Notes' is not clarified on first use.
[Throughout] Throughout: ensure CapArena-Auto is defined or cited at first mention and that all quantitative claims (e.g., 21-36% overhead) are accompanied by the exact experimental conditions under which they were measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional clarity and evidence are needed to support our claims. We agree that the abstract and experimental sections would benefit from explicit metric definitions, statistical controls, and direct tests of note transferability. We outline our responses and planned revisions below.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of Pareto-frontier performance and substantial gains on CapArena-Auto rest on unstated details of how factuality and coverage are measured, which baselines are used, whether statistical significance was assessed, the diversity of the image set, and any controls for bias in the multi-agent analysis that produced the notes.

Authors: We will revise the abstract to include concise definitions: factuality is quantified by an automated hallucination detector against human-annotated ground truth, and coverage is measured by recall of a predefined set of salient visual elements. We will name the main baselines (model scaling variants and prior multi-agent pipelines), report statistical significance via paired t-tests with p-values, characterize the image set as 1,000 images spanning COCO, Flickr30K, and domain-specific high-detail scenes, and note that the multi-agent pipeline uses independent agents with majority voting to mitigate bias. These details will also be expanded in a new 'Evaluation Protocol' subsection. revision: yes
Referee: [Method] Method and experimental sections: the central claim requires that hallucination/oversight patterns distilled into Structured Reflection Notes are consistent and reusable across images and models. No cross-validation, held-out image sets, or ablation on note specificity is described, leaving open the possibility that the notes encode analysis-set artifacts rather than model-invariant behaviors and that reported gains would not transfer.

Authors: The referee is correct that the current manuscript lacks explicit transfer evidence. We will add a cross-validation protocol in which Structured Reflection Notes are derived from a 500-image analysis subset and evaluated on a disjoint 500-image held-out set across all eight LVLMs. We will further include an ablation varying note specificity (model-general, model-specific, and image-specific variants) and report resulting changes in factuality and coverage. These additions will directly test reusability and rule out analysis-set artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ReflectCAP's empirical pipeline

full rationale

The paper proposes an empirical method: a multi-agent analysis identifies consistent hallucination/oversight patterns in a target LVLM, distills them into Structured Reflection Notes, and applies the notes at inference to steer caption generation. Factuality and coverage improvements are measured on external benchmarks (CapArena-Auto head-to-head judgments) rather than on quantities defined by the notes themselves. No equations, fitted parameters, or self-citation chains reduce the reported Pareto gains or compute advantages to the input analysis set by construction. The derivation chain is a standard pipeline with independent evaluation, yielding no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on a domain assumption about consistent model error patterns and introduces one new entity. No explicit free parameters are described in the abstract.

axioms (1)

domain assumption Large vision-language models exhibit consistent and identifiable patterns of hallucinations and systematic oversights across images that can be analyzed and distilled into reusable guidelines.
This assumption is required for the multi-agent analysis to produce notes that generalize beyond the analyzed examples.

invented entities (1)

Structured Reflection Notes no independent evidence
purpose: Reusable guidelines distilled from model analysis to steer captioning on what to avoid and what to attend to.
Newly postulated construct that forms the core of the inference-time guidance mechanism.

pith-pipeline@v0.9.0 · 5510 in / 1474 out tokens · 68137 ms · 2026-05-10T16:05:45.993502+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
cs.LG 2026-05 unverdicted novelty 5.0

LiSA improves AI guardrails lifelong by inducing conservative policies from sparse noisy failure reports via structured memory, conflict-aware rules, and posterior lower-bound gating.

Reference graph

Works this paper leans on

53 extracted references · 22 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

https://cdn

Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee,J.,Guo,Y.,etal.:Improvingimagegenerationwithbettercaptions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf2(3), 8 (2023)

2023
[2]

OpenAI Blog1(8), 1 (2024)

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024)

2024
[3]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Cheng, K., Song, W., Fan, J., Ma, Z., Sun, Q., Xu, F., Yan, C., Chen, N., Zhang, J., Chen, J.: CapArena: Benchmarking and analyzing detailed image captioning in the LLM era. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025. pp. 14077–14094. Association for Computational Lingui...

work page doi:10.18653/v1/2025.findings-acl.724 2025
[4]

Chung, J., Kim, J., Kim, S., Lee, J., Kim, M.S., Yu, Y.: v1: Learning to point visual tokens for multimodal grounded reasoning (2026),https://arxiv.org/abs/2505. 18842

2026
[5]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

2023
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swami- nathan, A., Soatto, S.: Multi-modal hallucination control by visual information grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14303–14312 (2024)

2024
[7]

In: European Conference on Computer Vision

Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)

2024
[8]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Garg, R., Burns, A., Karagol Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J.M., Soricut, R.: ImageInWords: Unlocking hyper-detailed image descriptions. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 93–127. Association fo...

work page doi:10.18653/v1/2024.emnlp-main.6 2024
[9]

Generating an image from 1,000 words: Enhancing text-to-image with structured captions, 2025

Gutflaish, E., Kachlon, E., Zisman, H., Hacham, T., Sarid, N., Visheratin, A., Huberman, S., Davidi, G., Bukchin, G., Goldberg, K., et al.: Generating an im- age from 1,000 words: Enhancing text-to-image with structured captions. arXiv preprint arXiv:2511.06876 (2025)

work page arXiv 2025
[10]

In: Findings of the Association for Computational Linguistics: ACL 2025

He, J., Lin, H., Wang, Q., Fung, Y.R., Ji, H.: Self-correction is more than refine- ment: A learning framework for visual and language reasoning tasks. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 6405–6421 (2025)

2025
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418–13427 (2024)

2024
[12]

Advances in Neural Information Processing Systems37, 48955–48970 (2024) 16 Min et al

Ju, X., Gao, Y., Zhang, Z., Yuan, Z., Wang, X., Zeng, A., Xiong, Y., Xu, Q., Shan, Y.: Miradata: A large-scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems37, 48955–48970 (2024) 16 Min et al

2024
[13]

When can LLMs actually correct their own mistakes? A survey of self-correction

Kamoi, R., Zhang, Y., Zhang, N., Han, J., Zhang, R.: When can LLMs actually cor- rect their own mistakes? a critical survey of self-correction of LLMs. Transactions of the Association for Computational Linguistics12, 1417–1440 (2024).https:// doi.org/10.1162/tacl_a_00713,https://aclanthology.org/2024.tacl-1.78/

work page doi:10.1162/tacl_a_00713 2024
[14]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[15]

In: NAACL (Long Papers)

Lee, K.i., Kim, M., Yoon, S., Kim, M., Lee, D., Koh, H., Jung, K.: VLind-bench: Measuring language priors in large vision-language models. In: Chiruzzo, L., Rit- ter, A., Wang, L. (eds.) Findings of the Association for Computational Linguis- tics: NAACL 2025. pp. 4129–4144. Association for Computational Linguistics, Albuquerque, New Mexico (Apr 2025).http...

work page doi:10.18653/v1/2025 2025
[16]

In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id=REnIf3dCsI

Lee, S., Yoon, S., Bui, T., Shi, J., Yoon, S.: Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage. In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id=REnIf3dCsI

2025
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)

2024
[18]

In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= 7BKcLeHQsm

Li, Z., Shi, H., Gao, Y., Liu, D., Wang, Z., Chen, Y., Liu, T., Zhao, L., Wang, H., Metaxas, D.N.: The hidden life of tokens: Reducing hallucination of large vision- language models via visual information steering. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= 7BKcLeHQsm

2025
[19]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucina- tion in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023)

work page internal anchor Pith review arXiv 2023
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024
[21]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[22]

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback (2023),https://arxiv.org/abs/2303.17651

work page internal anchor Pith review arXiv 2023
[23]

Y ., and Gkioxari, G

Marsili, D., Mehta, A., Lin, R.Y., Gkioxari, G.: Same or not? enhancing visual perception in vision-language models. arXiv preprint arXiv:2512.23592 (2025)

work page arXiv 2025
[24]

Merchant, N., de Ocáriz Borde, H.S., Popescu, A.C., Suarez, C.G.J.: Structured captionsimprovepromptadherenceintext-to-imagemodels(re-laion-caption19m) (2025),https://arxiv.org/abs/2507.05300

work page arXiv 2025
[25]

In: Chiruzzo, L., Ritter, A., Wang,L.(eds.)FindingsoftheAssociationforComputationalLinguistics:NAACL

Min, K., Kim, M., Lee, K.i., Lee, D., Jung, K.: Mitigating hallucinations in large vision-language models via summary-guided decoding. In: Chiruzzo, L., Ritter, A., Wang,L.(eds.)FindingsoftheAssociationforComputationalLinguistics:NAACL
[26]

4183–4198

pp. 4183–4198. Association for Computational Linguistics, Albuquerque, New Mexico (Apr 2025).https://doi.org/10.18653/v1/2025.findings-naacl. 235,https://aclanthology.org/2025.findings-naacl.235/ ReflectCAP: Detailed Image Captioning with Reflective Memory 17

work page doi:10.18653/v1/2025.findings-naacl 2025
[27]

Onoe, Y., Rane, S., Berger, Z., Bitton, Y., Cho, J., Garg, R., Ku, A., Parekh, Z., Pont-Tuset, J., Tanzer, G., Wang, S., Baldridge, J.: Docci: Descriptions of connected and contrasting images (2024),https://arxiv.org/abs/2404.19753

work page arXiv 2024
[28]

Ouyang, S., Yan, J., Hsu, I.H., Chen, Y., Jiang, K., Wang, Z., Han, R., Le, L.T., Daruki, S., Tang, X., Tirumalashetty, V., Lee, G., Rofouei, M., Lin, H., Han, J., Lee, C.Y., Pfister, T.: Reasoningbank: Scaling agent self-evolving with reasoning memory (2025),https://arxiv.org/abs/2509.25140

work page internal anchor Pith review arXiv 2025
[29]

Vision language models are blind: Failing to translate detailed visual features into words.arXiv preprint arXiv:2407.06581, 2024

Rahmanzadehgervi, P., Bolton, L., Taesiri, M.R., Nguyen, A.T.: Vision language models are blind: Failing to translate detailed visual features into words. arXiv preprint arXiv:2407.06581 (2024)

work page arXiv 2024
[30]

Rostamzadeh, N., Hosseini, S., Boquet, T., Stokowiec, W., Zhang, Y., Jauvin, C., Pal, C.: Fashion-gen: The generative fashion dataset and challenge (2018), https://arxiv.org/abs/1806.08317

work page arXiv 2018
[31]

Advances in neural information processing systems36, 8634–8652 (2023)

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in neural information processing systems36, 8634–8652 (2023)

2023
[32]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Sun, H.L., Sun, Z., Peng, H., Ye, H.J.: Mitigating visual forgetting via take-along visual conditioning for multi-modal long CoT reasoning. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 5158–5171. Association for Computationa...

work page doi:10.18653/v1/2025.acl- 2025
[33]

Tan, Z., Yan, J., Hsu, I.H., Han, R., Wang, Z., Le, L., Song, Y., Chen, Y., Palangi, H., Lee, G., et al.: In prospect and retrospect: Reflective memory management for long-termpersonalizeddialogueagents.In:Proceedingsofthe63rdAnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 8416–8439 (2025)

2025
[34]

Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions (2024),https://arxiv.org/abs/2312.08578

work page arXiv 2024
[35]

Wan, G., Ling, M., Ren, X., Han, R., Li, S., Zhang, Z.: Compass: Enhancing agent long-horizon reasoning with evolving context (2025),https://arxiv.org/abs/ 2510.08790

work page arXiv 2025
[36]

In: Findings of the Asso- ciation for Computational Linguistics: ACL 2024

Wang, X., Pan, J., Ding, L., Biemann, C.: Mitigating hallucinations in large vision- language models with instruction contrastive decoding. In: Findings of the Asso- ciation for Computational Linguistics: ACL 2024. pp. 15840–15853 (2024)

2024
[37]

In: Chiruzzo, L., Ritter, A., Wang, L

Yanuka, M., Ben-Kish, A., Bitton, Y., Szpektor, I., Giryes, R.: Bridging the visual gap: Fine-tuning multimodal models with knowledge-adapted captions. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Vo...

work page doi:10.18653/v1/2025.naacl- 2025
[38]

In: Ku, L.W., Martins, A., Srikumar, V

Yue, Z., Zhang, L., Jin, Q.: Less is more: Mitigating multimodal hallucination from an EOS decision perspective. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 11766–11781. Association for Computa- tional Linguistics, Bangkok, Thailand (Aug...

2024
[39]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, L., Zeng, X., Li, K., Yu, G., Chen, T.: Sc-captioner: Improving image captioning with self-correction by reinforcement learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23145–23155 (2025)

2025
[40]

In: Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations)

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z.: Llamafactory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations). pp. 400–410 (2024)

2024
[41]

Analyzing and mitigating object hallucination in large vision-language models,

Zhou, Y., Cui, C., Yoon, J., Zhang, L., Deng, Z., Finn, C., Bansal, M., Yao, H.: Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023)

work page arXiv 2023
[42]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

work page internal anchor Pith review arXiv 2023
[43]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhu, L., Ji, D., Chen, T., Xu, P., Ye, J., Liu, J.: Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1624–1633 (2025)

2025
[44]

Zhu, X., Cai, Y., Liu, Z., Zheng, B., Wang, C., Ye, R., Chen, J., Wang, H., Wang, W.C., Zhang, Y., et al.: Toward ultra-long-horizon agentic science: Cognitive accu- mulation for machine learning engineering. arXiv preprint arXiv:2601.10402 (2026) ReflectCAP: Detailed Image Captioning with Reflective Memory 19 Appendix A Supervised Fine-tuning Details We ...

work page arXiv 2026
[45]

IdentifyHALLUCINATIONS: details in the generated caption that are WRONGorNOT visiblein the image
[46]

Error Notes

IdentifyMISSING DETAILS: important details in the reference caption that areMISSINGfrom the generated caption. For each issue, provide: (1) what the issue is, (2) why it’s problematic, (3) a simple rule to avoid/fix it. Output format: Hallucinations: - issue 1, - issue 2, ... Missing Details: - issue 1, - issue 2, ... If no issues are found in a category,...
[47]

Review new issues from this batch
[48]

Update the error notes by adding new issues, merging similar ones, summa- rizing into general rules, and keepingmaximumkitemsper category
[49]

Avoid speculative or inferred details about materials, styles, or dates,

Each item should besimple and compact(one line). Output format: [Hallucination - Avoid These]: - item 1, - item 2, ... (max k) [Missing Detail - Include These]: - item 1, - item 2, ... (maxk) ReflectCAP: Detailed Image Captioning with Reflective Memory 21 User: Current Error Notes: {current_notes} New Issues from Batch: {batch_issues} Update the Error Not...
[50]

They are wearing a black leather jacket over a gray hooded sweatshirt

Zero-shot Caption “The image shows a person standing with their arms crossed. They are wearing a black leather jacket over a gray hooded sweatshirt. The jacket has multiple zippers and buttons, giving it a rugged look. The person is also wearing black, ripped skinny jeans and black shoes.”
[51]

Include all key garment design features and construction details (collars, clo- sures, pockets, cuffs)

ReflectCAP Caption “The image shows a person standing with their arms crossed, dressed in a black leather biker jacket with a classic asymmetrical front zipper closure over a grey hooded sweatshirt with a visible hood and draw- strings. The jacket features silver zippers, a wide lapel collar with snap-button details on the collar points, shoulder epaulets...
[52]

The jacket has a classic lapel, two front but- tons (of which the top one is buttoned), and flap pockets on each side

Zero-shot Caption “The image shows a person wearing a tailored black suit jacket over a white button-up shirt. The jacket has a classic lapel, two front but- tons (of which the top one is buttoned), and flap pockets on each side. Underneath the jacket, the person is wearing an additional black gar- ment that resembles a wrap-around skirt or apron with two...
[53]

Include all key garment design features and construction details (collars, clo- sures, pockets)

ReflectCAP Caption “The image depicts a person wearing a formal outfit consisting of a black tailored blazer with a classic notch lapel collar and a two-button closure over a white dress shirt. The blazer features flap pockets on both sides, a welt chest pocket on the left, and long sleeves with no visible buttons or cuff details. The blazer has a smooth,...