#PraCegoVer: A Large Dataset for Image Captioning in Portuguese
Pith reviewed 2026-05-24 13:41 UTC · model grok-4.3
The pith
The #PraCegoVer dataset supplies the first large collection of Portuguese captions for image captioning drawn from Instagram posts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images. The captions introduce additional challenges: only one reference sentence per image and both higher mean length and higher variance than the MS COCO Captions dataset.
What carries the argument
The #PraCegoVer dataset, a collection of Instagram images each paired with one user-written Portuguese caption collected via the accessibility hashtag.
If this is right
- Image captioning models can now be trained and tested directly in Portuguese.
- The single-reference format requires evaluation methods that do not assume multiple ground-truth sentences.
- The longer and more variable captions test a model's ability to produce detailed rather than terse descriptions.
- The public release allows immediate use by researchers working on accessibility tools for Portuguese speakers.
Where Pith is reading between the lines
- Training on these captions may improve robustness when models later encounter real social-media photographs rather than curated studio images.
- The same hashtag-driven collection method could be repeated for other languages that lack large caption datasets.
- Because each image has only one caption, new loss functions or ranking metrics may be needed to handle the resulting label noise.
Load-bearing premise
Captions written by Instagram users for the #PraCegoVer tag accurately and usefully describe the visual content of the attached images.
What would settle it
Manual inspection or automated analysis revealing that a large fraction of the collected captions fail to mention the main objects or actions visible in their images.
Figures
read the original abstract
Automatically describing images using natural sentences is an important task to support visually impaired people's inclusion onto the Internet. It is still a big challenge that requires understanding the relation of the objects present in the image and their attributes and actions they are involved in. Then, visual interpretation methods are needed, but linguistic models are also necessary to verbally describe the semantic relations. This problem is known as Image Captioning. Although many datasets were proposed in the literature, the majority contains only English captions, whereas datasets with captions described in other languages are scarce. Recently, a movement called PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Thus, inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images. Further, the captions in our dataset bring additional challenges to the problem: first, in contrast to popular datasets such as MS COCO Captions, #PraCegoVer has only one reference to each image; also, both mean and variance of our reference sentence length are significantly greater than those in the MS COCO Captions. These two characteristics contribute to making our dataset interesting due to the linguistic aspect and the challenges that it introduces to the image captioning problem. We publicly-share the dataset at https://github.com/gabrielsantosrv/PraCegoVer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces #PraCegoVer, a multimodal dataset of Instagram images paired with Portuguese captions scraped from posts tagged #PraCegoVer. It claims to be the first large-scale resource for image captioning in Portuguese, emphasizing that each image has a single reference caption whose mean length and variance exceed those in MS COCO Captions, thereby introducing new linguistic challenges.
Significance. If the user-generated captions can be shown to be accurate and relevant image descriptions, the dataset would address a clear scarcity of non-English resources and support development of captioning models for Portuguese, with potential benefits for accessibility applications.
major comments (2)
- [Data collection and dataset construction] The data collection process (described in the abstract and implied in the methods): the manuscript provides no automated filtering, manual review, inter-annotator agreement metrics, or sample-based quality assessment of the scraped captions. Because the central claim is that #PraCegoVer constitutes a usable training resource, the absence of evidence that captions are not vague, off-topic, or mismatched directly undermines the assertion that the dataset is suitable for image captioning models.
- [Abstract] Abstract: the statement that the captions 'bring additional challenges' due to single references and greater length/variance is presented without accompanying statistics (e.g., exact mean and standard deviation values, or a table comparing distributions to MS COCO), making it impossible to evaluate the claimed linguistic differences that are positioned as a contribution.
minor comments (1)
- [Abstract] The abstract contains a minor grammatical issue ('we have proposed the #PraCegoVer, a multi-modal dataset') that should be revised for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Data collection and dataset construction] The data collection process (described in the abstract and implied in the methods): the manuscript provides no automated filtering, manual review, inter-annotator agreement metrics, or sample-based quality assessment of the scraped captions. Because the central claim is that #PraCegoVer constitutes a usable training resource, the absence of evidence that captions are not vague, off-topic, or mismatched directly undermines the assertion that the dataset is suitable for image captioning models.
Authors: The captions originate from real user posts in the #PraCegoVer accessibility movement and are therefore inherently variable and potentially noisy, which is a deliberate characteristic of the resource rather than a flaw. Comprehensive manual review or automated filtering was not performed owing to the dataset scale (over 600k images). Inter-annotator agreement is inapplicable because each image has exactly one caption by construction. We will add a dedicated subsection with randomly sampled captions, basic automated statistics (e.g., presence of common visual terms), and an explicit limitations paragraph discussing caption quality. This will allow readers to assess suitability for their use cases. revision: partial
-
Referee: [Abstract] Abstract: the statement that the captions 'bring additional challenges' due to single references and greater length/variance is presented without accompanying statistics (e.g., exact mean and standard deviation values, or a table comparing distributions to MS COCO), making it impossible to evaluate the claimed linguistic differences that are positioned as a contribution.
Authors: We agree that the abstract would benefit from explicit numerical support. The full manuscript already contains length-distribution analysis and a comparison to MS COCO; we will move the key mean and standard-deviation figures into the abstract itself and ensure a compact comparison table appears in the main text or supplementary material. revision: yes
Circularity Check
No circularity: dataset release paper with no derivations or fitted quantities
full rationale
The paper describes collection and release of the #PraCegoVer Instagram-derived Portuguese caption dataset. No equations, predictions, parameters, or first-principles derivations are present; the central claim is simply that the scraped posts constitute a usable resource. No load-bearing step reduces to a self-definition, fitted input, or self-citation chain. This is the expected non-finding for a pure data paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
nocaps: novel object captioning at scale
AGRAWAL , H., D ESAI , K., W ANG , Y., C HEN , X., J AIN , R., J OHNSON , M., B ATRA, D., P ARIKH , D., L EE, S., AND ANDERSON , P. nocaps: novel object captioning at scale. In IEEE/CVF International Conference on Computer Vision (2019), pp. 8948–8957. 2
work page 2019
-
[2]
METEOR: An automatic metric for MT evaluation with improved correlation with human judgments
BANERJEE , S., AND LAVIE, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In The ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005), Association for Computational Linguistics, pp. 65–72. 15
work page 2005
-
[3]
BLEI , D. M., N G, A. Y., AND JORDAN , M. I. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan (2003), 993–1022. 10
work page 2003
-
[4]
CHEN , X., F ANG , H., L IN, T.-Y., V EDANTAM , R., G UPTA, S., D OLLÁR , P., AND ZITNICK , C. L. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015). 2
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[5]
Attend to you: Personalized image captioning with context sequence memory networks
CHUNSEONG PARK , C., K IM, B., AND KIM, G. Attend to you: Personalized image captioning with context sequence memory networks. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 895–
work page 2017
-
[6]
Image description using visual dependency representations
ELLIOTT , D., AND KELLER , F. Image description using visual dependency representations. In Conference on Empirical Methods in Natural Language Processing (2013), pp. 1292–1302. 2
work page 2013
-
[7]
A., Y OUNG , P., R ASHTCHIAN , C., H OCKENMAIER , J., AND FORSYTH , D
FARHADI , A., H EJRATI , M., S ADEGHI , M. A., Y OUNG , P., R ASHTCHIAN , C., H OCKENMAIER , J., AND FORSYTH , D. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision (2010), Springer, pp. 15–29. 2
work page 2010
-
[8]
Stylenet: Generating attractive visual captions with styles
GAN, C., G AN, Z., H E, X., G AO, J., AND DENG , L. Stylenet: Generating attractive visual captions with styles. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 3137–3146. 2
work page 2017
-
[9]
GEBRU , T., M ORGENSTERN , J., V ECCHIONE , B., V AUGHAN , J. W., W ALLACH , H., D AUMÉ III, H., AND CRAWFORD , K. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2020). 2, 19
-
[10]
Captioning images taken by people who are blind
GURARI , D., Z HAO, Y., Z HANG , M., AND BHATTACHARYA , N. Captioning images taken by people who are blind. In European Conference on Computer Vision (2020), Springer, pp. 417–434. 2
work page 2020
-
[11]
Deep multimodal semantic embeddings for speech and images
HARWATH, D., AND GLASS , J. Deep multimodal semantic embeddings for speech and images. In IEEE Workshop on Automatic Speech Recognition and Understanding (2015), pp. 237–244. 2
work page 2015
-
[12]
Framing image description as a ranking task: Data, models and evaluation metrics
HODOSH , M., Y OUNG , P., AND HOCKENMAIER , J. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of ArtificialIntelligence Research 47, 1 (May 2013), 853–899. 2
work page 2013
-
[13]
Attention on attention for image captioning
HUANG , L., W ANG , W., C HEN , J., AND WEI, X.-Y. Attention on attention for image captioning. In IEEE International Conference on Computer Vision (October 2019). 2, 14
work page 2019
-
[14]
What are you talking about? text-to-image coreference
KONG , C., L IN, D., B ANSAL , M., U RTASUN , R., AND FIDLER , S. What are you talking about? text-to-image coreference. In IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 3558–3565. 2
work page 2014
-
[15]
Openimages: A public dataset for large-scale multi-label and multi-class image classification
KRASIN , I., D UERIG , T., A LLDRIN , N., F ERRARI , V., A BU-E L-H AIJA , S., K UZNETSOVA , A., R OM, H., UIJLINGS , J., P OPOV, S., V EIT, A., ET AL . Openimages: A public dataset for large-scale multi-label and multi-class image classification. 18. 2
-
[16]
KRISHNA , R., Z HU, Y., G ROTH , O., J OHNSON , J., H ATA, K., K RAVITZ , J., C HEN , S., K ALANTIDIS , Y., L I, L.-J., S HAMMA , D. A., ET AL . Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73. 2 2The opinions expressed in this work do not necessarily r...
work page 2017
-
[17]
ROUGE: A package for automatic evaluation of summaries
LIN, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out (2004), pp. 74–81. 15
work page 2004
-
[18]
L IN, T.-Y., M AIRE , M., B ELONGIE , S., H AYS, J., P ERONA , P., R AMANAN , D., D OLLÁR , P., AND ZITNICK , C. L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (2014), Springer, pp. 740–755. 2
work page 2014
-
[19]
HDBSCAN: Hierarchical density based clustering
MCINNES , L., H EALY, J., AND ASTELS , S. HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software 2, 11 (2017), 205. 8
work page 2017
-
[20]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
MCINNES , L., H EALY, J., AND MELVILLE , J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018). 8
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
BLEU: A method for automatic evaluation of machine translation
PAPINENI , K., R OUKOS , S., W ARD , T., AND ZHU, W.-J. BLEU: A method for automatic evaluation of machine translation. In Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318. 15
work page 2002
-
[22]
A., W ANG , L., C ERVANTES , C
PLUMMER , B. A., W ANG , L., C ERVANTES , C. M., C AICEDO , J. C., H OCKENMAIER , J., AND LAZEBNIK , S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.International Journal of Computer Vision 123, 1 (May 2017), 74–93. 2
work page 2017
-
[23]
Collecting image annotations using amazon’s mechanical turk
RASHTCHIAN , C., Y OUNG , P., H ODOSH , M., AND HOCKENMAIER , J. Collecting image annotations using amazon’s mechanical turk. In Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (2010), pp. 139–147. 2
work page 2010
-
[24]
J., M ARCHERET , E., M ROUEH , Y., ROSS , J., AND GOEL , V
RENNIE , S. J., M ARCHERET , E., M ROUEH , Y., ROSS , J., AND GOEL , V. Self-critical sequence training for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 7008–7024. 14, 15
work page 2017
-
[25]
MobileNetv2: Inverted residuals and linear bottlenecks
SANDLER , M., H OWARD, A., Z HU, M., Z HMOGINOV , A., AND CHEN , L.-C. MobileNetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 4510–4520. 8
work page 2018
-
[26]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
SHARMA , P., D ING , N., G OODMAN , S., AND SORICUT , R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers) (2018), pp. 2556–2565. 2
work page 2018
-
[27]
Textcaps: a dataset for image captioning with reading comprehension
SIDOROV, O., H U, R., R OHRBACH , M., AND SINGH , A. Textcaps: a dataset for image captioning with reading comprehension. In European Conference on Computer Vision (2020), Springer, pp. 742–758. 2
work page 2020
-
[28]
TIPPING , M. E., AND BISHOP , C. M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 3 (1999), 611–622. 8
work page 1999
-
[29]
VEDANTAM , R., Z ITNICK , C. L., AND PARIKH , D. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4566–4575. 14, 15
work page 2015
-
[30]
Criadora do projeto #PraCegoVer incentiva a descrição de imagens na web
WEB PARA TODOS . Criadora do projeto #PraCegoVer incentiva a descrição de imagens na web. http://mwpt.com.br/criadora-do-projeto-pracegover-incentiva-descricao-de-imagens-na-web, 2018. 1
work page 2018
-
[31]
ZITNICK , C. L., P ARIKH , D., AND VANDERWENDE , L. Learning the visual interpretation of sentences. In IEEE International Conference on Computer Vision (2013), pp. 1681–1688. 2 18 #PraCegoVer: A Large Dataset for Image Captioning in Portuguese A PREPRINT A #PraCegoVer dataset Here, we present a detailed description of the dataset, a datasheet for the #Pr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.