Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task
Pith reviewed 2026-05-21 05:28 UTC · model grok-4.3
The pith
A two-stage pipeline using Spanish intermediate captions and retrieval-augmented prompting achieves over 120 percent gains on Indigenous language image captioning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that an intermediate Spanish caption generated by a vision-language model, followed by retrieval-augmented many-shot translation into the target Indigenous language, produces captions that substantially outperform the shared-task baseline on automatic metrics and secure first place in the overall competition.
What carries the argument
Retrieval-augmented many-shot prompting from the Spanish pivot caption, which draws relevant in-domain examples to guide culturally appropriate generation in the low-resource target language.
If this is right
- Retrieval augmentation improves results only when large, in-domain corpora exist for the target language.
- Synthetic data augmentation contributes roughly 28 chrF++ points to the Guaraní dev-set gains.
- The system maintains over 150 percent relative improvement on Bribri and Orizaba Nahuatl test sets.
- Automatic-metric wins do not guarantee top human-evaluation rank among finalists.
Where Pith is reading between the lines
- The pivot-through-Spanish strategy may transfer to other low-resource language pairs that lack native vision models.
- The language-dependent nature of retrieval suggests future work should prioritize corpus size and domain match before applying the technique.
- If the Spanish pivot proves robust, the same two-stage structure could support image-based cultural knowledge bases in additional Indigenous languages.
Load-bearing premise
The Spanish intermediate captions produced by Qwen2.5-VL are sufficiently accurate and culturally neutral to serve as a reliable pivot for the subsequent retrieval-augmented translation step.
What would settle it
A direct comparison in which the Spanish captions are replaced by noisy or culturally biased alternatives, or in which the retrieval component is removed entirely, would show whether the reported metric gains disappear.
Figures
read the original abstract
We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaran\'i, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaran\'i performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. It describes a two-stage pipeline that first generates Spanish intermediate captions using Qwen2.5-VL and then produces target-language captions (Bribri, Guaraní, Orizaba Nahuatl) via retrieval-augmented many-shot prompting with Gemini 2.5 Flash. The authors report relative chrF++ gains of 164.1%, 131.7%, and 122.6% over the shared-task baseline on the dev set, sustained >150% gains on test for two languages, and note that synthetic data augmentation accounts for ~28 chrF++ on Guaraní; their system won the shared task overall and placed second in human evaluations.
Significance. If the results hold, the work provides concrete evidence that retrieval-augmented translation combined with synthetic data can yield large gains on low-resource Indigenous language image captioning. The language-dependent retrieval findings and the quantified synthetic-data contribution are useful for practitioners. The shared-task win and human-evaluation ranking add practical weight, though fuller isolation of the vision component would strengthen claims about cultural fidelity.
major comments (1)
- [§3] §3 (Pipeline Description): No automatic metrics, human ratings, or error analysis are reported for the Spanish intermediate captions generated by Qwen2.5-VL. Because these captions are treated as the culturally accurate pivot for the subsequent retrieval-augmented translation step, the absence of validation leaves the source of the headline relative improvements (e.g., 164.1% for Bribri on dev) unisolated and risks conflating vision-model fidelity with LLM priors or prompting effects.
minor comments (2)
- [Abstract] Abstract: The string 'Guaraní' appears with LaTeX escaping; ensure consistent Unicode rendering throughout the manuscript.
- [§4] §4 (Ablations): The retrieval-corpus construction details (size, domain filtering, and exact selection criteria) are only summarized; adding a short table or paragraph would improve reproducibility of the language-dependent findings.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive feedback on our submission to the AmericasNLP 2026 shared task. We address the major comment point by point below and have incorporated revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Pipeline Description): No automatic metrics, human ratings, or error analysis are reported for the Spanish intermediate captions generated by Qwen2.5-VL. Because these captions are treated as the culturally accurate pivot for the subsequent retrieval-augmented translation step, the absence of validation leaves the source of the headline relative improvements (e.g., 164.1% for Bribri on dev) unisolated and risks conflating vision-model fidelity with LLM priors or prompting effects.
Authors: We agree that validating the quality of the Spanish intermediate captions is important for isolating the contributions of each stage in our pipeline. In the revised manuscript, we have added a new subsection in §3 that reports automatic metrics (chrF++ and BLEU) for the Qwen2.5-VL generated Spanish captions against available reference Spanish captions from the dataset. Additionally, we include a brief error analysis highlighting common issues such as cultural nuances missed in the vision-to-text step. This revision helps clarify that the large gains in target languages stem from both the accurate Spanish pivot and the retrieval-augmented translation. We note, however, that the primary focus of the shared task and our evaluation remains on the Indigenous target languages, where human evaluations further support the overall pipeline effectiveness. revision: yes
Circularity Check
No circularity: empirical pipeline measured against external shared-task baseline
full rationale
The paper presents a practical two-stage system for the AmericasNLP 2026 shared task: Qwen2.5-VL generates Spanish image captions, followed by retrieval-augmented many-shot translation into target Indigenous languages using Gemini. Reported gains (e.g., 164.1% relative chrF++ on Bribri dev) are direct comparisons to the external shared-task baseline on held-out dev and test sets. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. The method description and ablation notes (e.g., synthetic augmentation contributing ~28 chrF++ on Guaraní) remain independent of the final performance numbers, which are externally benchmarked. This is a standard empirical submission paper with no internal reductions of outputs to inputs.
Axiom & Free-Parameter Ledger
free parameters (4)
- Selection of Qwen2.5-VL for Spanish captioning
- Selection of Gemini 2.5 Flash for target-language generation
- Retrieval corpus construction and size
- Number of retrieved shots and prompt formatting
axioms (2)
- domain assumption Vision-language models produce usable Spanish descriptions of culturally relevant images
- domain assumption Large language models can translate or generate target-language captions when given retrieved in-domain examples
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2305.19474 , year=
Ethical considerations for machine translation of indigenous languages: Giving a voice to the speakers , author=. arXiv preprint arXiv:2305.19474 , year=
-
[2]
The urbanization of the Guarani language: a problem in language and culture , author=. 1972 , publisher=
work page 1972
-
[3]
Linguistic society of America , year=
What is an endangered language , author=. Linguistic society of America , year=
-
[4]
Yliana Rodr. The challenges of creating a corpus of minority languages and its dialects in Natural Language Processing: the case of the South American indigenous language Guarani , howpublished =. 2022 , url =
work page 2022
-
[5]
Improving Neural Machine Translation Models with Monolingual Data , author=. Proceedings of ACL , year=
-
[6]
Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , author=. Proceedings of EMNLP , year=
-
[7]
No Language Left Behind: Scaling Human-Centered Machine Translation
No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Proceedings of the AmericasNLP Workshop , year=
Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas , author=. Proceedings of the AmericasNLP Workshop , year=
work page 2021
-
[9]
Proceedings of the AmericasNLP Workshop , year=
Findings of the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages , author=. Proceedings of the AmericasNLP Workshop , year=
work page 2023
-
[10]
Proceedings of the AmericasNLP Workshop , year=
Findings of the AmericasNLP 2024 Shared Task on Machine Translation into Indigenous Languages , author=. Proceedings of the AmericasNLP Workshop , year=
work page 2024
-
[11]
Sheffield's Submission to the AmericasNLP Shared Task on Machine Translation into Indigenous Languages , author =. Proceedings of the Third Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year =
-
[12]
Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models , author =. Proceedings of the Third Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year =
-
[13]
IndT5: A Text-to-Text Transformer for 10 Indigenous Languages , author =. Proceedings of the First Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year =
-
[14]
Towards a Guarani-Spanish Bilingual Corpus for Machine Translation , author=. Proceedings of the Second Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) , year=
-
[15]
MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data , author=. 2025 , eprint=
work page 2025
-
[16]
Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Dhawan, Aashish and Driggers-Ellis, Christopher and Grant, Christan and Wang, Daisy Zhe. Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing. Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages ( L o R es MT 2026). 2026. doi:10.18653/v1/2026.loresmt-1.10
- [19]
-
[20]
Xiao, Bushi and Shen, Qian and Wang, Daisy Zhe. From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments. Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025). 2025. doi:10.18653/v1/2025.loresmt-1.4
-
[21]
Popovi. chr. Proceedings of the Tenth Workshop on Statistical Machine Translation , year =
-
[22]
Popovi. chr. Proceedings of the Second Conference on Machine Translation , year =
-
[23]
and Coto-solano, Rolando and Cruz, Hilaria and Palmer, Alexis and Kann, Katharina
Ebrahimi, Abteen and Mager, Manuel and Rijhwani, Shruti and Rice, Enora and Oncevay, Arturo and Baltazar, Claudia and Cort \'e s, Mar \'i a and Monta \ n o, Cynthia and Ortega, John E. and Coto-solano, Rolando and Cruz, Hilaria and Palmer, Alexis and Kann, Katharina. Findings of the A mericas NLP 2023 Shared Task on Machine Translation into Indigenous Lan...
-
[24]
Robertson, Stephen and Zaragoza, Hugo , title =. 2009 , publisher =. doi:10.1561/1500000019 , journal =
-
[25]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
GPT-4 Technical Report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [27]
-
[28]
Coto-Solano, Rolando. Explicit Tone Transcription Improves ASR Performance in Extremely Low-Resource Languages: A Case Study in B ribri. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas. 2021. doi:10.18653/v1/2021.americasnlp-1.20
-
[29]
Bui, Minh Duc and Guzm. Findings of the A mericas NLP 2026 Shared Task on Cultural Image Captioning for I ndigenous Languages. Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP). 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.