Recognition: unknown
ReflectCAP: Detailed Image Captioning with Reflective Memory
Pith reviewed 2026-05-10 16:05 UTC · model grok-4.3
The pith
ReflectCAP distills patterns of what vision-language models hallucinate or overlook into reusable notes that steer them toward more factual and complete image captions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReflectCAP uses a multi-agent pipeline to analyze consistent hallucination and oversight patterns in a target LVLM, distills the patterns into reusable Structured Reflection Notes, and applies those notes during inference to direct the model along both the avoidance and attention axes, resulting in captions that jointly advance factuality and coverage while lowering compute overhead relative to scaling or other multi-agent baselines.
What carries the argument
Structured Reflection Notes: reusable guidelines distilled from multi-agent analysis of a given LVLM's hallucination and oversight patterns that tell the model what to avoid and what to attend to during caption generation.
If this is right
- ReflectCAP reaches the Pareto frontier of the factuality-coverage trade-off across eight tested LVLMs.
- It produces substantial gains on head-to-head CapArena-Auto evaluations against strong reference models.
- The method achieves higher caption quality at lower compute cost than model scaling.
- It avoids the 21 to 36 percent extra overhead incurred by existing multi-agent pipelines.
Where Pith is reading between the lines
- The same error-pattern distillation process could be applied to other vision-language tasks where models repeatedly hallucinate or omit elements.
- Once created, the notes might transfer across different images or even different base models without retraining.
- This points toward a general strategy of converting observed model weaknesses into lightweight, reusable instructions rather than increasing model size or inference rounds.
Load-bearing premise
The hallucination and oversight patterns found by the multi-agent analysis remain consistent enough across images to be captured in reusable notes that improve results without creating new errors or biases.
What would settle it
Run the notes on a fresh held-out image set and measure that captions produced with the notes show no gain or a loss in combined factuality-coverage scores compared with the same model without the notes.
Figures
read the original abstract
Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes -- what to avoid and what to attend to -- yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21--36\% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReflectCAP, a method for detailed image captioning that uses a multi-agent pipeline to identify consistent hallucinations and systematic oversights in target LVLMs, distills these patterns into reusable Structured Reflection Notes, and applies the notes at inference time to steer the model toward improved factuality and coverage. Experiments apply the approach to 8 LVLMs across GPT-4.1, Qwen, and InternVL families, claiming Pareto-frontier performance on the factuality-coverage trade-off, substantial gains on the CapArena-Auto head-to-head benchmark, and a superior quality-compute trade-off relative to model scaling or prior multi-agent pipelines (with 21-36% lower overhead).
Significance. If the generalizability of the Structured Reflection Notes holds and the evaluation details are supplied, the work could offer a practical, low-overhead route to higher-quality detailed captions without relying on larger models. The reported compute advantage over scaling and multi-agent baselines would be a meaningful contribution for resource-constrained deployment. At present, however, the absence of metric definitions, statistical controls, and transfer evidence limits the assessed impact.
major comments (2)
- [Abstract] Abstract: the claims of Pareto-frontier performance and substantial gains on CapArena-Auto rest on unstated details of how factuality and coverage are measured, which baselines are used, whether statistical significance was assessed, the diversity of the image set, and any controls for bias in the multi-agent analysis that produced the notes.
- [Method] Method and experimental sections: the central claim requires that hallucination/oversight patterns distilled into Structured Reflection Notes are consistent and reusable across images and models. No cross-validation, held-out image sets, or ablation on note specificity is described, leaving open the possibility that the notes encode analysis-set artifacts rather than model-invariant behaviors and that reported gains would not transfer.
minor comments (2)
- [Abstract] Abstract: the relationship between the title's 'Reflective Memory' and the body term 'Structured Reflection Notes' is not clarified on first use.
- [Throughout] Throughout: ensure CapArena-Auto is defined or cited at first mention and that all quantitative claims (e.g., 21-36% overhead) are accompanied by the exact experimental conditions under which they were measured.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional clarity and evidence are needed to support our claims. We agree that the abstract and experimental sections would benefit from explicit metric definitions, statistical controls, and direct tests of note transferability. We outline our responses and planned revisions below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claims of Pareto-frontier performance and substantial gains on CapArena-Auto rest on unstated details of how factuality and coverage are measured, which baselines are used, whether statistical significance was assessed, the diversity of the image set, and any controls for bias in the multi-agent analysis that produced the notes.
Authors: We will revise the abstract to include concise definitions: factuality is quantified by an automated hallucination detector against human-annotated ground truth, and coverage is measured by recall of a predefined set of salient visual elements. We will name the main baselines (model scaling variants and prior multi-agent pipelines), report statistical significance via paired t-tests with p-values, characterize the image set as 1,000 images spanning COCO, Flickr30K, and domain-specific high-detail scenes, and note that the multi-agent pipeline uses independent agents with majority voting to mitigate bias. These details will also be expanded in a new 'Evaluation Protocol' subsection. revision: yes
-
Referee: [Method] Method and experimental sections: the central claim requires that hallucination/oversight patterns distilled into Structured Reflection Notes are consistent and reusable across images and models. No cross-validation, held-out image sets, or ablation on note specificity is described, leaving open the possibility that the notes encode analysis-set artifacts rather than model-invariant behaviors and that reported gains would not transfer.
Authors: The referee is correct that the current manuscript lacks explicit transfer evidence. We will add a cross-validation protocol in which Structured Reflection Notes are derived from a 500-image analysis subset and evaluated on a disjoint 500-image held-out set across all eight LVLMs. We will further include an ablation varying note specificity (model-general, model-specific, and image-specific variants) and report resulting changes in factuality and coverage. These additions will directly test reusability and rule out analysis-set artifacts. revision: yes
Circularity Check
No significant circularity in ReflectCAP's empirical pipeline
full rationale
The paper proposes an empirical method: a multi-agent analysis identifies consistent hallucination/oversight patterns in a target LVLM, distills them into Structured Reflection Notes, and applies the notes at inference to steer caption generation. Factuality and coverage improvements are measured on external benchmarks (CapArena-Auto head-to-head judgments) rather than on quantities defined by the notes themselves. No equations, fitted parameters, or self-citation chains reduce the reported Pareto gains or compute advantages to the input analysis set by construction. The derivation chain is a standard pipeline with independent evaluation, yielding no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large vision-language models exhibit consistent and identifiable patterns of hallucinations and systematic oversights across images that can be analyzed and distilled into reusable guidelines.
invented entities (1)
-
Structured Reflection Notes
no independent evidence
Forward citations
Cited by 1 Pith paper
-
LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
LiSA improves AI guardrails lifelong by inducing conservative policies from sparse noisy failure reports via structured memory, conflict-aware rules, and posterior lower-bound gating.
Reference graph
Works this paper leans on
-
[1]
https://cdn
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee,J.,Guo,Y.,etal.:Improvingimagegenerationwithbettercaptions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf2(3), 8 (2023)
2023
-
[2]
OpenAI Blog1(8), 1 (2024)
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024)
2024
-
[3]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Cheng, K., Song, W., Fan, J., Ma, Z., Sun, Q., Xu, F., Yan, C., Chen, N., Zhang, J., Chen, J.: CapArena: Benchmarking and analyzing detailed image captioning in the LLM era. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025. pp. 14077–14094. Association for Computational Lingui...
-
[4]
Chung, J., Kim, J., Kim, S., Lee, J., Kim, M.S., Yu, Y.: v1: Learning to point visual tokens for multimodal grounded reasoning (2026),https://arxiv.org/abs/2505. 18842
2026
-
[5]
Advances in neural information processing systems36, 49250–49267 (2023)
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)
2023
-
[6]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swami- nathan, A., Soatto, S.: Multi-modal hallucination control by visual information grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14303–14312 (2024)
2024
-
[7]
In: European Conference on Computer Vision
Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)
2024
-
[8]
In: Al-Onaizan, Y., Bansal, M., Chen, Y.N
Garg, R., Burns, A., Karagol Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J.M., Soricut, R.: ImageInWords: Unlocking hyper-detailed image descriptions. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 93–127. Association fo...
-
[9]
Generating an image from 1,000 words: Enhancing text-to-image with structured captions, 2025
Gutflaish, E., Kachlon, E., Zisman, H., Hacham, T., Sarid, N., Visheratin, A., Huberman, S., Davidi, G., Bukchin, G., Goldberg, K., et al.: Generating an im- age from 1,000 words: Enhancing text-to-image with structured captions. arXiv preprint arXiv:2511.06876 (2025)
-
[10]
In: Findings of the Association for Computational Linguistics: ACL 2025
He, J., Lin, H., Wang, Q., Fung, Y.R., Ji, H.: Self-correction is more than refine- ment: A learning framework for visual and language reasoning tasks. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 6405–6421 (2025)
2025
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418–13427 (2024)
2024
-
[12]
Advances in Neural Information Processing Systems37, 48955–48970 (2024) 16 Min et al
Ju, X., Gao, Y., Zhang, Z., Yuan, Z., Wang, X., Zeng, A., Xiong, Y., Xu, Q., Shan, Y.: Miradata: A large-scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems37, 48955–48970 (2024) 16 Min et al
2024
-
[13]
When can LLMs actually correct their own mistakes? A survey of self-correction
Kamoi, R., Zhang, Y., Zhang, N., Han, J., Zhang, R.: When can LLMs actually cor- rect their own mistakes? a critical survey of self-correction of LLMs. Transactions of the Association for Computational Linguistics12, 1417–1440 (2024).https:// doi.org/10.1162/tacl_a_00713,https://aclanthology.org/2024.tacl-1.78/
-
[14]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[15]
Lee, K.i., Kim, M., Yoon, S., Kim, M., Lee, D., Koh, H., Jung, K.: VLind-bench: Measuring language priors in large vision-language models. In: Chiruzzo, L., Rit- ter, A., Wang, L. (eds.) Findings of the Association for Computational Linguis- tics: NAACL 2025. pp. 4129–4144. Association for Computational Linguistics, Albuquerque, New Mexico (Apr 2025).http...
-
[16]
In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id=REnIf3dCsI
Lee, S., Yoon, S., Bui, T., Shi, J., Yoon, S.: Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage. In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum?id=REnIf3dCsI
2025
-
[17]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)
2024
-
[18]
In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= 7BKcLeHQsm
Li, Z., Shi, H., Gao, Y., Liu, D., Wang, Z., Chen, Y., Liu, T., Zhao, L., Wang, H., Metaxas, D.N.: The hidden life of tokens: Reducing hallucination of large vision- language models via visual information steering. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id= 7BKcLeHQsm
2025
-
[19]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucina- tion in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023)
work page internal anchor Pith review arXiv 2023
-
[20]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)
2024
-
[21]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
2023
-
[22]
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback (2023),https://arxiv.org/abs/2303.17651
work page internal anchor Pith review arXiv 2023
-
[23]
Marsili, D., Mehta, A., Lin, R.Y., Gkioxari, G.: Same or not? enhancing visual perception in vision-language models. arXiv preprint arXiv:2512.23592 (2025)
- [24]
-
[25]
In: Chiruzzo, L., Ritter, A., Wang,L.(eds.)FindingsoftheAssociationforComputationalLinguistics:NAACL
Min, K., Kim, M., Lee, K.i., Lee, D., Jung, K.: Mitigating hallucinations in large vision-language models via summary-guided decoding. In: Chiruzzo, L., Ritter, A., Wang,L.(eds.)FindingsoftheAssociationforComputationalLinguistics:NAACL
-
[26]
pp. 4183–4198. Association for Computational Linguistics, Albuquerque, New Mexico (Apr 2025).https://doi.org/10.18653/v1/2025.findings-naacl. 235,https://aclanthology.org/2025.findings-naacl.235/ ReflectCAP: Detailed Image Captioning with Reflective Memory 17
- [27]
-
[28]
Ouyang, S., Yan, J., Hsu, I.H., Chen, Y., Jiang, K., Wang, Z., Han, R., Le, L.T., Daruki, S., Tang, X., Tirumalashetty, V., Lee, G., Rofouei, M., Lin, H., Han, J., Lee, C.Y., Pfister, T.: Reasoningbank: Scaling agent self-evolving with reasoning memory (2025),https://arxiv.org/abs/2509.25140
work page internal anchor Pith review arXiv 2025
-
[29]
Rahmanzadehgervi, P., Bolton, L., Taesiri, M.R., Nguyen, A.T.: Vision language models are blind: Failing to translate detailed visual features into words. arXiv preprint arXiv:2407.06581 (2024)
- [30]
-
[31]
Advances in neural information processing systems36, 8634–8652 (2023)
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in neural information processing systems36, 8634–8652 (2023)
2023
-
[32]
Sun, H.L., Sun, Z., Peng, H., Ye, H.J.: Mitigating visual forgetting via take-along visual conditioning for multi-modal long CoT reasoning. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 5158–5171. Association for Computationa...
-
[33]
Tan, Z., Yan, J., Hsu, I.H., Han, R., Wang, Z., Le, L., Song, Y., Chen, Y., Palangi, H., Lee, G., et al.: In prospect and retrospect: Reflective memory management for long-termpersonalizeddialogueagents.In:Proceedingsofthe63rdAnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 8416–8439 (2025)
2025
- [34]
- [35]
-
[36]
In: Findings of the Asso- ciation for Computational Linguistics: ACL 2024
Wang, X., Pan, J., Ding, L., Biemann, C.: Mitigating hallucinations in large vision- language models with instruction contrastive decoding. In: Findings of the Asso- ciation for Computational Linguistics: ACL 2024. pp. 15840–15853 (2024)
2024
-
[37]
In: Chiruzzo, L., Ritter, A., Wang, L
Yanuka, M., Ben-Kish, A., Bitton, Y., Szpektor, I., Giryes, R.: Bridging the visual gap: Fine-tuning multimodal models with knowledge-adapted captions. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Vo...
-
[38]
In: Ku, L.W., Martins, A., Srikumar, V
Yue, Z., Zhang, L., Jin, Q.: Less is more: Mitigating multimodal hallucination from an EOS decision perspective. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 11766–11781. Association for Computa- tional Linguistics, Bangkok, Thailand (Aug...
2024
-
[39]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhang, L., Zeng, X., Li, K., Yu, G., Chen, T.: Sc-captioner: Improving image captioning with self-correction by reinforcement learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23145–23155 (2025)
2025
-
[40]
In: Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations)
Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z.: Llamafactory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations). pp. 400–410 (2024)
2024
-
[41]
Analyzing and mitigating object hallucination in large vision-language models,
Zhou, Y., Cui, C., Yoon, J., Zhang, L., Deng, Z., Finn, C., Bansal, M., Yao, H.: Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023)
-
[42]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
work page internal anchor Pith review arXiv 2023
-
[43]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zhu, L., Ji, D., Chen, T., Xu, P., Ye, J., Liu, J.: Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1624–1633 (2025)
2025
-
[44]
Zhu, X., Cai, Y., Liu, Z., Zheng, B., Wang, C., Ye, R., Chen, J., Wang, H., Wang, W.C., Zhang, Y., et al.: Toward ultra-long-horizon agentic science: Cognitive accu- mulation for machine learning engineering. arXiv preprint arXiv:2601.10402 (2026) ReflectCAP: Detailed Image Captioning with Reflective Memory 19 Appendix A Supervised Fine-tuning Details We ...
-
[45]
IdentifyHALLUCINATIONS: details in the generated caption that are WRONGorNOT visiblein the image
-
[46]
Error Notes
IdentifyMISSING DETAILS: important details in the reference caption that areMISSINGfrom the generated caption. For each issue, provide: (1) what the issue is, (2) why it’s problematic, (3) a simple rule to avoid/fix it. Output format: Hallucinations: - issue 1, - issue 2, ... Missing Details: - issue 1, - issue 2, ... If no issues are found in a category,...
-
[47]
Review new issues from this batch
-
[48]
Update the error notes by adding new issues, merging similar ones, summa- rizing into general rules, and keepingmaximumkitemsper category
-
[49]
Avoid speculative or inferred details about materials, styles, or dates,
Each item should besimple and compact(one line). Output format: [Hallucination - Avoid These]: - item 1, - item 2, ... (max k) [Missing Detail - Include These]: - item 1, - item 2, ... (maxk) ReflectCAP: Detailed Image Captioning with Reflective Memory 21 User: Current Error Notes: {current_notes} New Issues from Batch: {batch_issues} Update the Error Not...
-
[50]
They are wearing a black leather jacket over a gray hooded sweatshirt
Zero-shot Caption “The image shows a person standing with their arms crossed. They are wearing a black leather jacket over a gray hooded sweatshirt. The jacket has multiple zippers and buttons, giving it a rugged look. The person is also wearing black, ripped skinny jeans and black shoes.”
-
[51]
Include all key garment design features and construction details (collars, clo- sures, pockets, cuffs)
ReflectCAP Caption “The image shows a person standing with their arms crossed, dressed in a black leather biker jacket with a classic asymmetrical front zipper closure over a grey hooded sweatshirt with a visible hood and draw- strings. The jacket features silver zippers, a wide lapel collar with snap-button details on the collar points, shoulder epaulets...
-
[52]
The jacket has a classic lapel, two front but- tons (of which the top one is buttoned), and flap pockets on each side
Zero-shot Caption “The image shows a person wearing a tailored black suit jacket over a white button-up shirt. The jacket has a classic lapel, two front but- tons (of which the top one is buttoned), and flap pockets on each side. Underneath the jacket, the person is wearing an additional black gar- ment that resembles a wrap-around skirt or apron with two...
-
[53]
Include all key garment design features and construction details (collars, clo- sures, pockets)
ReflectCAP Caption “The image depicts a person wearing a formal outfit consisting of a black tailored blazer with a classic notch lapel collar and a two-button closure over a white dress shirt. The blazer features flap pockets on both sides, a welt chest pocket on the left, and long sleeves with no visible buttons or cuff details. The blazer has a smooth,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.