pith. sign in

arxiv: 2103.11474 · v2 · submitted 2021-03-21 · 💻 cs.CV · cs.CL

#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

Pith reviewed 2026-05-24 13:41 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords image captioningPortugueseInstagram datasetmulti-modal dataaccessibilityuser-generated captionsvisually impaired
0
0 comments X

The pith

The #PraCegoVer dataset supplies the first large collection of Portuguese captions for image captioning drawn from Instagram posts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects user-posted images and their attached Portuguese descriptions from the #PraCegoVer social media movement to create a training resource for automatic image description. This resource addresses the scarcity of non-English data and adds realism by using single captions per image whose lengths vary more than those in English collections such as MS COCO. A reader would care because the resulting models could help visually impaired Portuguese speakers access image content on the web.

Core claim

We have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images. The captions introduce additional challenges: only one reference sentence per image and both higher mean length and higher variance than the MS COCO Captions dataset.

What carries the argument

The #PraCegoVer dataset, a collection of Instagram images each paired with one user-written Portuguese caption collected via the accessibility hashtag.

If this is right

  • Image captioning models can now be trained and tested directly in Portuguese.
  • The single-reference format requires evaluation methods that do not assume multiple ground-truth sentences.
  • The longer and more variable captions test a model's ability to produce detailed rather than terse descriptions.
  • The public release allows immediate use by researchers working on accessibility tools for Portuguese speakers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training on these captions may improve robustness when models later encounter real social-media photographs rather than curated studio images.
  • The same hashtag-driven collection method could be repeated for other languages that lack large caption datasets.
  • Because each image has only one caption, new loss functions or ranking metrics may be needed to handle the resulting label noise.

Load-bearing premise

Captions written by Instagram users for the #PraCegoVer tag accurately and usefully describe the visual content of the attached images.

What would settle it

Manual inspection or automated analysis revealing that a large fraction of the collected captions fail to mention the main objects or actions visible in their images.

Figures

Figures reproduced from arXiv: 2103.11474 by Esther Luna Colombini, Gabriel Oliveira dos Santos, Sandra Avila.

Figure 1
Figure 1. Figure 1: Diagram illustrating the pipeline of data collection. We start filtering the posts by hashtags and save the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Two similar images posted on Instagram by two different profiles: User 1 and User 2. It can be seen that [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distance matrix constructed from the pair-wise cosine distance based on (a) image features, and (b) text [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Similarity graphs considering only image features and considering both visual and textual features. It can be [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the whole pipeline from the data collection to the dataset split. First, we collect the data, clean [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of a real caption in which is tagged the hashtag [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The graph shows the cumulative explained variance by the number of components computed using the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The total number of posts tagging #PraCegoVer (dashed line) and [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Histogram showing the number of clusters of images whose size is within each band. There is only one cluster [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A sample of images from a cluster whose majority of the images are related to Perfumes. We highlighted the [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sample from a cluster of airplanes. It is worth noting the variety of positions of the airplanes, some images [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Samples from a cluster of birds. There is a diversity of species of birds as well as a variety of number of [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Samples from a cluster of cartoons. It worth noting that most of the cartoons are made by the same author, [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Samples from a cluster of cartoons. Note that despite the images present in this cluster being cartoons, they [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Samples from a cluster of informative texts. This cluster illustrates draft laws presented in Brazilian [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Word clouds showing the most frequent words in each topic found in the dataset. The topics were modeled [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Histogram of the distribution of captions by length in terms of number of words. We plot the caption length [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Histogram of word frequency of #PraCegoVer (blue) and MS COCO (red) datasets. We plot the number of words for each considering frequency range. the Cross-Entropy Loss and then directly maximizing CIDEr-D score [29] using Self-Critical Sequence Training (SCST) [24]. We evaluated the models considering the same metrics used on MS COCO competition: BLEU [21], ROUGE [17], METEOR [2] and CIDEr-D [29]. 7.1 Resu… view at source ↗
Figure 19
Figure 19. Figure 19: Examples of images followed by their reference captions and the descriptions generated by the model trained [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗
read the original abstract

Automatically describing images using natural sentences is an important task to support visually impaired people's inclusion onto the Internet. It is still a big challenge that requires understanding the relation of the objects present in the image and their attributes and actions they are involved in. Then, visual interpretation methods are needed, but linguistic models are also necessary to verbally describe the semantic relations. This problem is known as Image Captioning. Although many datasets were proposed in the literature, the majority contains only English captions, whereas datasets with captions described in other languages are scarce. Recently, a movement called PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Thus, inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images. Further, the captions in our dataset bring additional challenges to the problem: first, in contrast to popular datasets such as MS COCO Captions, #PraCegoVer has only one reference to each image; also, both mean and variance of our reference sentence length are significantly greater than those in the MS COCO Captions. These two characteristics contribute to making our dataset interesting due to the linguistic aspect and the challenges that it introduces to the image captioning problem. We publicly-share the dataset at https://github.com/gabrielsantosrv/PraCegoVer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces #PraCegoVer, a multimodal dataset of Instagram images paired with Portuguese captions scraped from posts tagged #PraCegoVer. It claims to be the first large-scale resource for image captioning in Portuguese, emphasizing that each image has a single reference caption whose mean length and variance exceed those in MS COCO Captions, thereby introducing new linguistic challenges.

Significance. If the user-generated captions can be shown to be accurate and relevant image descriptions, the dataset would address a clear scarcity of non-English resources and support development of captioning models for Portuguese, with potential benefits for accessibility applications.

major comments (2)
  1. [Data collection and dataset construction] The data collection process (described in the abstract and implied in the methods): the manuscript provides no automated filtering, manual review, inter-annotator agreement metrics, or sample-based quality assessment of the scraped captions. Because the central claim is that #PraCegoVer constitutes a usable training resource, the absence of evidence that captions are not vague, off-topic, or mismatched directly undermines the assertion that the dataset is suitable for image captioning models.
  2. [Abstract] Abstract: the statement that the captions 'bring additional challenges' due to single references and greater length/variance is presented without accompanying statistics (e.g., exact mean and standard deviation values, or a table comparing distributions to MS COCO), making it impossible to evaluate the claimed linguistic differences that are positioned as a contribution.
minor comments (1)
  1. [Abstract] The abstract contains a minor grammatical issue ('we have proposed the #PraCegoVer, a multi-modal dataset') that should be revised for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Data collection and dataset construction] The data collection process (described in the abstract and implied in the methods): the manuscript provides no automated filtering, manual review, inter-annotator agreement metrics, or sample-based quality assessment of the scraped captions. Because the central claim is that #PraCegoVer constitutes a usable training resource, the absence of evidence that captions are not vague, off-topic, or mismatched directly undermines the assertion that the dataset is suitable for image captioning models.

    Authors: The captions originate from real user posts in the #PraCegoVer accessibility movement and are therefore inherently variable and potentially noisy, which is a deliberate characteristic of the resource rather than a flaw. Comprehensive manual review or automated filtering was not performed owing to the dataset scale (over 600k images). Inter-annotator agreement is inapplicable because each image has exactly one caption by construction. We will add a dedicated subsection with randomly sampled captions, basic automated statistics (e.g., presence of common visual terms), and an explicit limitations paragraph discussing caption quality. This will allow readers to assess suitability for their use cases. revision: partial

  2. Referee: [Abstract] Abstract: the statement that the captions 'bring additional challenges' due to single references and greater length/variance is presented without accompanying statistics (e.g., exact mean and standard deviation values, or a table comparing distributions to MS COCO), making it impossible to evaluate the claimed linguistic differences that are positioned as a contribution.

    Authors: We agree that the abstract would benefit from explicit numerical support. The full manuscript already contains length-distribution analysis and a comparison to MS COCO; we will move the key mean and standard-deviation figures into the abstract itself and ensure a compact comparison table appears in the main text or supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release paper with no derivations or fitted quantities

full rationale

The paper describes collection and release of the #PraCegoVer Instagram-derived Portuguese caption dataset. No equations, predictions, parameters, or first-principles derivations are present; the central claim is simply that the scraped posts constitute a usable resource. No load-bearing step reduces to a self-definition, fitted input, or self-citation chain. This is the expected non-finding for a pure data paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are required; the work is a data curation effort whose central claim rests on the assumption that the collected Instagram captions are usable as-is.

pith-pipeline@v0.9.0 · 5812 in / 1083 out tokens · 48113 ms · 2026-05-24T13:41:59.062588+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

  1. [1]

    nocaps: novel object captioning at scale

    AGRAWAL , H., D ESAI , K., W ANG , Y., C HEN , X., J AIN , R., J OHNSON , M., B ATRA, D., P ARIKH , D., L EE, S., AND ANDERSON , P. nocaps: novel object captioning at scale. In IEEE/CVF International Conference on Computer Vision (2019), pp. 8948–8957. 2

  2. [2]

    METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

    BANERJEE , S., AND LAVIE, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In The ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005), Association for Computational Linguistics, pp. 65–72. 15

  3. [3]

    M., N G, A

    BLEI , D. M., N G, A. Y., AND JORDAN , M. I. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan (2003), 993–1022. 10

  4. [4]

    CHEN , X., F ANG , H., L IN, T.-Y., V EDANTAM , R., G UPTA, S., D OLLÁR , P., AND ZITNICK , C. L. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015). 2

  5. [5]

    Attend to you: Personalized image captioning with context sequence memory networks

    CHUNSEONG PARK , C., K IM, B., AND KIM, G. Attend to you: Personalized image captioning with context sequence memory networks. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 895–

  6. [6]

    Image description using visual dependency representations

    ELLIOTT , D., AND KELLER , F. Image description using visual dependency representations. In Conference on Empirical Methods in Natural Language Processing (2013), pp. 1292–1302. 2

  7. [7]

    A., Y OUNG , P., R ASHTCHIAN , C., H OCKENMAIER , J., AND FORSYTH , D

    FARHADI , A., H EJRATI , M., S ADEGHI , M. A., Y OUNG , P., R ASHTCHIAN , C., H OCKENMAIER , J., AND FORSYTH , D. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision (2010), Springer, pp. 15–29. 2

  8. [8]

    Stylenet: Generating attractive visual captions with styles

    GAN, C., G AN, Z., H E, X., G AO, J., AND DENG , L. Stylenet: Generating attractive visual captions with styles. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 3137–3146. 2

  9. [9]

    Datasheets for Datasets

    GEBRU , T., M ORGENSTERN , J., V ECCHIONE , B., V AUGHAN , J. W., W ALLACH , H., D AUMÉ III, H., AND CRAWFORD , K. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2020). 2, 19

  10. [10]

    Captioning images taken by people who are blind

    GURARI , D., Z HAO, Y., Z HANG , M., AND BHATTACHARYA , N. Captioning images taken by people who are blind. In European Conference on Computer Vision (2020), Springer, pp. 417–434. 2

  11. [11]

    Deep multimodal semantic embeddings for speech and images

    HARWATH, D., AND GLASS , J. Deep multimodal semantic embeddings for speech and images. In IEEE Workshop on Automatic Speech Recognition and Understanding (2015), pp. 237–244. 2

  12. [12]

    Framing image description as a ranking task: Data, models and evaluation metrics

    HODOSH , M., Y OUNG , P., AND HOCKENMAIER , J. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of ArtificialIntelligence Research 47, 1 (May 2013), 853–899. 2

  13. [13]

    Attention on attention for image captioning

    HUANG , L., W ANG , W., C HEN , J., AND WEI, X.-Y. Attention on attention for image captioning. In IEEE International Conference on Computer Vision (October 2019). 2, 14

  14. [14]

    What are you talking about? text-to-image coreference

    KONG , C., L IN, D., B ANSAL , M., U RTASUN , R., AND FIDLER , S. What are you talking about? text-to-image coreference. In IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 3558–3565. 2

  15. [15]

    Openimages: A public dataset for large-scale multi-label and multi-class image classification

    KRASIN , I., D UERIG , T., A LLDRIN , N., F ERRARI , V., A BU-E L-H AIJA , S., K UZNETSOVA , A., R OM, H., UIJLINGS , J., P OPOV, S., V EIT, A., ET AL . Openimages: A public dataset for large-scale multi-label and multi-class image classification. 18. 2

  16. [16]

    A., ET AL

    KRISHNA , R., Z HU, Y., G ROTH , O., J OHNSON , J., H ATA, K., K RAVITZ , J., C HEN , S., K ALANTIDIS , Y., L I, L.-J., S HAMMA , D. A., ET AL . Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73. 2 2The opinions expressed in this work do not necessarily r...

  17. [17]

    ROUGE: A package for automatic evaluation of summaries

    LIN, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out (2004), pp. 74–81. 15

  18. [18]

    L IN, T.-Y., M AIRE , M., B ELONGIE , S., H AYS, J., P ERONA , P., R AMANAN , D., D OLLÁR , P., AND ZITNICK , C. L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (2014), Springer, pp. 740–755. 2

  19. [19]

    HDBSCAN: Hierarchical density based clustering

    MCINNES , L., H EALY, J., AND ASTELS , S. HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software 2, 11 (2017), 205. 8

  20. [20]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    MCINNES , L., H EALY, J., AND MELVILLE , J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018). 8

  21. [21]

    BLEU: A method for automatic evaluation of machine translation

    PAPINENI , K., R OUKOS , S., W ARD , T., AND ZHU, W.-J. BLEU: A method for automatic evaluation of machine translation. In Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318. 15

  22. [22]

    A., W ANG , L., C ERVANTES , C

    PLUMMER , B. A., W ANG , L., C ERVANTES , C. M., C AICEDO , J. C., H OCKENMAIER , J., AND LAZEBNIK , S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.International Journal of Computer Vision 123, 1 (May 2017), 74–93. 2

  23. [23]

    Collecting image annotations using amazon’s mechanical turk

    RASHTCHIAN , C., Y OUNG , P., H ODOSH , M., AND HOCKENMAIER , J. Collecting image annotations using amazon’s mechanical turk. In Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (2010), pp. 139–147. 2

  24. [24]

    J., M ARCHERET , E., M ROUEH , Y., ROSS , J., AND GOEL , V

    RENNIE , S. J., M ARCHERET , E., M ROUEH , Y., ROSS , J., AND GOEL , V. Self-critical sequence training for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 7008–7024. 14, 15

  25. [25]

    MobileNetv2: Inverted residuals and linear bottlenecks

    SANDLER , M., H OWARD, A., Z HU, M., Z HMOGINOV , A., AND CHEN , L.-C. MobileNetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 4510–4520. 8

  26. [26]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    SHARMA , P., D ING , N., G OODMAN , S., AND SORICUT , R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers) (2018), pp. 2556–2565. 2

  27. [27]

    Textcaps: a dataset for image captioning with reading comprehension

    SIDOROV, O., H U, R., R OHRBACH , M., AND SINGH , A. Textcaps: a dataset for image captioning with reading comprehension. In European Conference on Computer Vision (2020), Springer, pp. 742–758. 2

  28. [28]

    E., AND BISHOP , C

    TIPPING , M. E., AND BISHOP , C. M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 3 (1999), 611–622. 8

  29. [29]

    L., AND PARIKH , D

    VEDANTAM , R., Z ITNICK , C. L., AND PARIKH , D. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4566–4575. 14, 15

  30. [30]

    Criadora do projeto #PraCegoVer incentiva a descrição de imagens na web

    WEB PARA TODOS . Criadora do projeto #PraCegoVer incentiva a descrição de imagens na web. http://mwpt.com.br/criadora-do-projeto-pracegover-incentiva-descricao-de-imagens-na-web, 2018. 1

  31. [31]

    racism”, “discrimination

    ZITNICK , C. L., P ARIKH , D., AND VANDERWENDE , L. Learning the visual interpretation of sentences. In IEEE International Conference on Computer Vision (2013), pp. 1681–1688. 2 18 #PraCegoVer: A Large Dataset for Image Captioning in Portuguese A PREPRINT A #PraCegoVer dataset Here, we present a detailed description of the dataset, a datasheet for the #Pr...