#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

Esther Luna Colombini; Gabriel Oliveira dos Santos; Sandra Avila

arxiv: 2103.11474 · v2 · submitted 2021-03-21 · 💻 cs.CV · cs.CL

#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

Gabriel Oliveira dos Santos , Esther Luna Colombini , Sandra Avila This is my paper

Pith reviewed 2026-05-24 13:41 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords image captioningPortugueseInstagram datasetmulti-modal dataaccessibilityuser-generated captionsvisually impaired

0 comments

The pith

The #PraCegoVer dataset supplies the first large collection of Portuguese captions for image captioning drawn from Instagram posts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects user-posted images and their attached Portuguese descriptions from the #PraCegoVer social media movement to create a training resource for automatic image description. This resource addresses the scarcity of non-English data and adds realism by using single captions per image whose lengths vary more than those in English collections such as MS COCO. A reader would care because the resulting models could help visually impaired Portuguese speakers access image content on the web.

Core claim

We have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images. The captions introduce additional challenges: only one reference sentence per image and both higher mean length and higher variance than the MS COCO Captions dataset.

What carries the argument

The #PraCegoVer dataset, a collection of Instagram images each paired with one user-written Portuguese caption collected via the accessibility hashtag.

If this is right

Image captioning models can now be trained and tested directly in Portuguese.
The single-reference format requires evaluation methods that do not assume multiple ground-truth sentences.
The longer and more variable captions test a model's ability to produce detailed rather than terse descriptions.
The public release allows immediate use by researchers working on accessibility tools for Portuguese speakers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training on these captions may improve robustness when models later encounter real social-media photographs rather than curated studio images.
The same hashtag-driven collection method could be repeated for other languages that lack large caption datasets.
Because each image has only one caption, new loss functions or ranking metrics may be needed to handle the resulting label noise.

Load-bearing premise

Captions written by Instagram users for the #PraCegoVer tag accurately and usefully describe the visual content of the attached images.

What would settle it

Manual inspection or automated analysis revealing that a large fraction of the collected captions fail to mention the main objects or actions visible in their images.

Figures

Figures reproduced from arXiv: 2103.11474 by Esther Luna Colombini, Gabriel Oliveira dos Santos, Sandra Avila.

**Figure 1.** Figure 1: Diagram illustrating the pipeline of data collection. We start filtering the posts by hashtags and save the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Two similar images posted on Instagram by two different profiles: User 1 and User 2. It can be seen that [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Distance matrix constructed from the pair-wise cosine distance based on (a) image features, and (b) text [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Similarity graphs considering only image features and considering both visual and textual features. It can be [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the whole pipeline from the data collection to the dataset split. First, we collect the data, clean [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Example of a real caption in which is tagged the hashtag [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The graph shows the cumulative explained variance by the number of components computed using the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: The total number of posts tagging #PraCegoVer (dashed line) and [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Histogram showing the number of clusters of images whose size is within each band. There is only one cluster [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: A sample of images from a cluster whose majority of the images are related to Perfumes. We highlighted the [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Sample from a cluster of airplanes. It is worth noting the variety of positions of the airplanes, some images [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Samples from a cluster of birds. There is a diversity of species of birds as well as a variety of number of [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Samples from a cluster of cartoons. It worth noting that most of the cartoons are made by the same author, [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: Samples from a cluster of cartoons. Note that despite the images present in this cluster being cartoons, they [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

**Figure 15.** Figure 15: Samples from a cluster of informative texts. This cluster illustrates draft laws presented in Brazilian [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 16.** Figure 16: Word clouds showing the most frequent words in each topic found in the dataset. The topics were modeled [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: Histogram of the distribution of captions by length in terms of number of words. We plot the caption length [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗

**Figure 18.** Figure 18: Histogram of word frequency of #PraCegoVer (blue) and MS COCO (red) datasets. We plot the number of words for each considering frequency range. the Cross-Entropy Loss and then directly maximizing CIDEr-D score [29] using Self-Critical Sequence Training (SCST) [24]. We evaluated the models considering the same metrics used on MS COCO competition: BLEU [21], ROUGE [17], METEOR [2] and CIDEr-D [29]. 7.1 Resu… view at source ↗

**Figure 19.** Figure 19: Examples of images followed by their reference captions and the descriptions generated by the model trained [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗

read the original abstract

Automatically describing images using natural sentences is an important task to support visually impaired people's inclusion onto the Internet. It is still a big challenge that requires understanding the relation of the objects present in the image and their attributes and actions they are involved in. Then, visual interpretation methods are needed, but linguistic models are also necessary to verbally describe the semantic relations. This problem is known as Image Captioning. Although many datasets were proposed in the literature, the majority contains only English captions, whereas datasets with captions described in other languages are scarce. Recently, a movement called PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Thus, inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images. Further, the captions in our dataset bring additional challenges to the problem: first, in contrast to popular datasets such as MS COCO Captions, #PraCegoVer has only one reference to each image; also, both mean and variance of our reference sentence length are significantly greater than those in the MS COCO Captions. These two characteristics contribute to making our dataset interesting due to the linguistic aspect and the challenges that it introduces to the image captioning problem. We publicly-share the dataset at https://github.com/gabrielsantosrv/PraCegoVer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases the first large Portuguese image captioning dataset scraped from Instagram but reports no checks on whether the user captions actually describe the images.

read the letter

The main takeaway is a new dataset, #PraCegoVer, built from Instagram posts tagged with that hashtag. It is presented as the first sizable Portuguese resource for image captioning, with single references per image and longer sentences on average than MS COCO. The authors release it publicly on GitHub and note these differences as sources of additional challenge for models. That fills a clear gap for a language spoken by hundreds of millions where captioning data has been scarce, and the release itself is a concrete step that others can use for accessibility work or multilingual experiments. The paper does not claim new models or performance numbers, so the contribution stays at the data level. The soft spot is the lack of any described validation or filtering. The captions are taken directly from user posts without reported automated cleaning, manual review, or inter-annotator checks on fidelity. If a noticeable share of them are vague, off-topic, or mismatched to the images, the dataset cannot reliably serve as training targets. The abstract focuses on collection and scale rather than quality metrics, which leaves the usability claim resting on an untested assumption. This work is for groups that need non-English caption data or want to test models on single-reference, longer-sentence regimes. Readers working on vision-language tasks in Portuguese or other under-resourced languages will get direct value from the release. It shows straightforward engagement with the literature on dataset gaps. I would bring it to a reading group focused on multilingual or accessibility topics. It is worth citing if you need a Portuguese starting point. The paper deserves peer review because the data release addresses a documented shortage, even if reviewers will need to press on the caption quality steps.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces #PraCegoVer, a multimodal dataset of Instagram images paired with Portuguese captions scraped from posts tagged #PraCegoVer. It claims to be the first large-scale resource for image captioning in Portuguese, emphasizing that each image has a single reference caption whose mean length and variance exceed those in MS COCO Captions, thereby introducing new linguistic challenges.

Significance. If the user-generated captions can be shown to be accurate and relevant image descriptions, the dataset would address a clear scarcity of non-English resources and support development of captioning models for Portuguese, with potential benefits for accessibility applications.

major comments (2)

[Data collection and dataset construction] The data collection process (described in the abstract and implied in the methods): the manuscript provides no automated filtering, manual review, inter-annotator agreement metrics, or sample-based quality assessment of the scraped captions. Because the central claim is that #PraCegoVer constitutes a usable training resource, the absence of evidence that captions are not vague, off-topic, or mismatched directly undermines the assertion that the dataset is suitable for image captioning models.
[Abstract] Abstract: the statement that the captions 'bring additional challenges' due to single references and greater length/variance is presented without accompanying statistics (e.g., exact mean and standard deviation values, or a table comparing distributions to MS COCO), making it impossible to evaluate the claimed linguistic differences that are positioned as a contribution.

minor comments (1)

[Abstract] The abstract contains a minor grammatical issue ('we have proposed the #PraCegoVer, a multi-modal dataset') that should be revised for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Data collection and dataset construction] The data collection process (described in the abstract and implied in the methods): the manuscript provides no automated filtering, manual review, inter-annotator agreement metrics, or sample-based quality assessment of the scraped captions. Because the central claim is that #PraCegoVer constitutes a usable training resource, the absence of evidence that captions are not vague, off-topic, or mismatched directly undermines the assertion that the dataset is suitable for image captioning models.

Authors: The captions originate from real user posts in the #PraCegoVer accessibility movement and are therefore inherently variable and potentially noisy, which is a deliberate characteristic of the resource rather than a flaw. Comprehensive manual review or automated filtering was not performed owing to the dataset scale (over 600k images). Inter-annotator agreement is inapplicable because each image has exactly one caption by construction. We will add a dedicated subsection with randomly sampled captions, basic automated statistics (e.g., presence of common visual terms), and an explicit limitations paragraph discussing caption quality. This will allow readers to assess suitability for their use cases. revision: partial
Referee: [Abstract] Abstract: the statement that the captions 'bring additional challenges' due to single references and greater length/variance is presented without accompanying statistics (e.g., exact mean and standard deviation values, or a table comparing distributions to MS COCO), making it impossible to evaluate the claimed linguistic differences that are positioned as a contribution.

Authors: We agree that the abstract would benefit from explicit numerical support. The full manuscript already contains length-distribution analysis and a comparison to MS COCO; we will move the key mean and standard-deviation figures into the abstract itself and ensure a compact comparison table appears in the main text or supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release paper with no derivations or fitted quantities

full rationale

The paper describes collection and release of the #PraCegoVer Instagram-derived Portuguese caption dataset. No equations, predictions, parameters, or first-principles derivations are present; the central claim is simply that the scraped posts constitute a usable resource. No load-bearing step reduces to a self-definition, fitted input, or self-citation chain. This is the expected non-finding for a pure data paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are required; the work is a data curation effort whose central claim rests on the assumption that the collected Instagram captions are usable as-is.

pith-pipeline@v0.9.0 · 5812 in / 1083 out tokens · 48113 ms · 2026-05-24T13:41:59.062588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

nocaps: novel object captioning at scale

AGRAWAL , H., D ESAI , K., W ANG , Y., C HEN , X., J AIN , R., J OHNSON , M., B ATRA, D., P ARIKH , D., L EE, S., AND ANDERSON , P. nocaps: novel object captioning at scale. In IEEE/CVF International Conference on Computer Vision (2019), pp. 8948–8957. 2

work page 2019
[2]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

BANERJEE , S., AND LAVIE, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In The ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005), Association for Computational Linguistics, pp. 65–72. 15

work page 2005
[3]

M., N G, A

BLEI , D. M., N G, A. Y., AND JORDAN , M. I. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan (2003), 993–1022. 10

work page 2003
[4]

CHEN , X., F ANG , H., L IN, T.-Y., V EDANTAM , R., G UPTA, S., D OLLÁR , P., AND ZITNICK , C. L. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015). 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

Attend to you: Personalized image captioning with context sequence memory networks

CHUNSEONG PARK , C., K IM, B., AND KIM, G. Attend to you: Personalized image captioning with context sequence memory networks. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 895–

work page 2017
[6]

Image description using visual dependency representations

ELLIOTT , D., AND KELLER , F. Image description using visual dependency representations. In Conference on Empirical Methods in Natural Language Processing (2013), pp. 1292–1302. 2

work page 2013
[7]

A., Y OUNG , P., R ASHTCHIAN , C., H OCKENMAIER , J., AND FORSYTH , D

FARHADI , A., H EJRATI , M., S ADEGHI , M. A., Y OUNG , P., R ASHTCHIAN , C., H OCKENMAIER , J., AND FORSYTH , D. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision (2010), Springer, pp. 15–29. 2

work page 2010
[8]

Stylenet: Generating attractive visual captions with styles

GAN, C., G AN, Z., H E, X., G AO, J., AND DENG , L. Stylenet: Generating attractive visual captions with styles. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 3137–3146. 2

work page 2017
[9]

Datasheets for Datasets

GEBRU , T., M ORGENSTERN , J., V ECCHIONE , B., V AUGHAN , J. W., W ALLACH , H., D AUMÉ III, H., AND CRAWFORD , K. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2020). 2, 19

work page arXiv 2020
[10]

Captioning images taken by people who are blind

GURARI , D., Z HAO, Y., Z HANG , M., AND BHATTACHARYA , N. Captioning images taken by people who are blind. In European Conference on Computer Vision (2020), Springer, pp. 417–434. 2

work page 2020
[11]

Deep multimodal semantic embeddings for speech and images

HARWATH, D., AND GLASS , J. Deep multimodal semantic embeddings for speech and images. In IEEE Workshop on Automatic Speech Recognition and Understanding (2015), pp. 237–244. 2

work page 2015
[12]

Framing image description as a ranking task: Data, models and evaluation metrics

HODOSH , M., Y OUNG , P., AND HOCKENMAIER , J. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of ArtiﬁcialIntelligence Research 47, 1 (May 2013), 853–899. 2

work page 2013
[13]

Attention on attention for image captioning

HUANG , L., W ANG , W., C HEN , J., AND WEI, X.-Y. Attention on attention for image captioning. In IEEE International Conference on Computer Vision (October 2019). 2, 14

work page 2019
[14]

What are you talking about? text-to-image coreference

KONG , C., L IN, D., B ANSAL , M., U RTASUN , R., AND FIDLER , S. What are you talking about? text-to-image coreference. In IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 3558–3565. 2

work page 2014
[15]

Openimages: A public dataset for large-scale multi-label and multi-class image classiﬁcation

KRASIN , I., D UERIG , T., A LLDRIN , N., F ERRARI , V., A BU-E L-H AIJA , S., K UZNETSOVA , A., R OM, H., UIJLINGS , J., P OPOV, S., V EIT, A., ET AL . Openimages: A public dataset for large-scale multi-label and multi-class image classiﬁcation. 18. 2

work page
[16]

A., ET AL

KRISHNA , R., Z HU, Y., G ROTH , O., J OHNSON , J., H ATA, K., K RAVITZ , J., C HEN , S., K ALANTIDIS , Y., L I, L.-J., S HAMMA , D. A., ET AL . Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73. 2 2The opinions expressed in this work do not necessarily r...

work page 2017
[17]

ROUGE: A package for automatic evaluation of summaries

LIN, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out (2004), pp. 74–81. 15

work page 2004
[18]

L IN, T.-Y., M AIRE , M., B ELONGIE , S., H AYS, J., P ERONA , P., R AMANAN , D., D OLLÁR , P., AND ZITNICK , C. L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (2014), Springer, pp. 740–755. 2

work page 2014
[19]

HDBSCAN: Hierarchical density based clustering

MCINNES , L., H EALY, J., AND ASTELS , S. HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software 2, 11 (2017), 205. 8

work page 2017
[20]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

MCINNES , L., H EALY, J., AND MELVILLE , J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018). 8

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

BLEU: A method for automatic evaluation of machine translation

PAPINENI , K., R OUKOS , S., W ARD , T., AND ZHU, W.-J. BLEU: A method for automatic evaluation of machine translation. In Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318. 15

work page 2002
[22]

A., W ANG , L., C ERVANTES , C

PLUMMER , B. A., W ANG , L., C ERVANTES , C. M., C AICEDO , J. C., H OCKENMAIER , J., AND LAZEBNIK , S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.International Journal of Computer Vision 123, 1 (May 2017), 74–93. 2

work page 2017
[23]

Collecting image annotations using amazon’s mechanical turk

RASHTCHIAN , C., Y OUNG , P., H ODOSH , M., AND HOCKENMAIER , J. Collecting image annotations using amazon’s mechanical turk. In Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (2010), pp. 139–147. 2

work page 2010
[24]

J., M ARCHERET , E., M ROUEH , Y., ROSS , J., AND GOEL , V

RENNIE , S. J., M ARCHERET , E., M ROUEH , Y., ROSS , J., AND GOEL , V. Self-critical sequence training for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 7008–7024. 14, 15

work page 2017
[25]

MobileNetv2: Inverted residuals and linear bottlenecks

SANDLER , M., H OWARD, A., Z HU, M., Z HMOGINOV , A., AND CHEN , L.-C. MobileNetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 4510–4520. 8

work page 2018
[26]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

SHARMA , P., D ING , N., G OODMAN , S., AND SORICUT , R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers) (2018), pp. 2556–2565. 2

work page 2018
[27]

Textcaps: a dataset for image captioning with reading comprehension

SIDOROV, O., H U, R., R OHRBACH , M., AND SINGH , A. Textcaps: a dataset for image captioning with reading comprehension. In European Conference on Computer Vision (2020), Springer, pp. 742–758. 2

work page 2020
[28]

E., AND BISHOP , C

TIPPING , M. E., AND BISHOP , C. M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 3 (1999), 611–622. 8

work page 1999
[29]

L., AND PARIKH , D

VEDANTAM , R., Z ITNICK , C. L., AND PARIKH , D. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4566–4575. 14, 15

work page 2015
[30]

Criadora do projeto #PraCegoVer incentiva a descrição de imagens na web

WEB PARA TODOS . Criadora do projeto #PraCegoVer incentiva a descrição de imagens na web. http://mwpt.com.br/criadora-do-projeto-pracegover-incentiva-descricao-de-imagens-na-web, 2018. 1

work page 2018
[31]

racism”, “discrimination

ZITNICK , C. L., P ARIKH , D., AND VANDERWENDE , L. Learning the visual interpretation of sentences. In IEEE International Conference on Computer Vision (2013), pp. 1681–1688. 2 18 #PraCegoVer: A Large Dataset for Image Captioning in Portuguese A PREPRINT A #PraCegoVer dataset Here, we present a detailed description of the dataset, a datasheet for the #Pr...

work page arXiv 2013

[1] [1]

nocaps: novel object captioning at scale

AGRAWAL , H., D ESAI , K., W ANG , Y., C HEN , X., J AIN , R., J OHNSON , M., B ATRA, D., P ARIKH , D., L EE, S., AND ANDERSON , P. nocaps: novel object captioning at scale. In IEEE/CVF International Conference on Computer Vision (2019), pp. 8948–8957. 2

work page 2019

[2] [2]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

BANERJEE , S., AND LAVIE, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In The ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005), Association for Computational Linguistics, pp. 65–72. 15

work page 2005

[3] [3]

M., N G, A

BLEI , D. M., N G, A. Y., AND JORDAN , M. I. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan (2003), 993–1022. 10

work page 2003

[4] [4]

CHEN , X., F ANG , H., L IN, T.-Y., V EDANTAM , R., G UPTA, S., D OLLÁR , P., AND ZITNICK , C. L. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015). 2

work page internal anchor Pith review Pith/arXiv arXiv 2015

[5] [5]

Attend to you: Personalized image captioning with context sequence memory networks

CHUNSEONG PARK , C., K IM, B., AND KIM, G. Attend to you: Personalized image captioning with context sequence memory networks. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 895–

work page 2017

[6] [6]

Image description using visual dependency representations

ELLIOTT , D., AND KELLER , F. Image description using visual dependency representations. In Conference on Empirical Methods in Natural Language Processing (2013), pp. 1292–1302. 2

work page 2013

[7] [7]

A., Y OUNG , P., R ASHTCHIAN , C., H OCKENMAIER , J., AND FORSYTH , D

FARHADI , A., H EJRATI , M., S ADEGHI , M. A., Y OUNG , P., R ASHTCHIAN , C., H OCKENMAIER , J., AND FORSYTH , D. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision (2010), Springer, pp. 15–29. 2

work page 2010

[8] [8]

Stylenet: Generating attractive visual captions with styles

GAN, C., G AN, Z., H E, X., G AO, J., AND DENG , L. Stylenet: Generating attractive visual captions with styles. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 3137–3146. 2

work page 2017

[9] [9]

Datasheets for Datasets

GEBRU , T., M ORGENSTERN , J., V ECCHIONE , B., V AUGHAN , J. W., W ALLACH , H., D AUMÉ III, H., AND CRAWFORD , K. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2020). 2, 19

work page arXiv 2020

[10] [10]

Captioning images taken by people who are blind

GURARI , D., Z HAO, Y., Z HANG , M., AND BHATTACHARYA , N. Captioning images taken by people who are blind. In European Conference on Computer Vision (2020), Springer, pp. 417–434. 2

work page 2020

[11] [11]

Deep multimodal semantic embeddings for speech and images

HARWATH, D., AND GLASS , J. Deep multimodal semantic embeddings for speech and images. In IEEE Workshop on Automatic Speech Recognition and Understanding (2015), pp. 237–244. 2

work page 2015

[12] [12]

Framing image description as a ranking task: Data, models and evaluation metrics

HODOSH , M., Y OUNG , P., AND HOCKENMAIER , J. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of ArtiﬁcialIntelligence Research 47, 1 (May 2013), 853–899. 2

work page 2013

[13] [13]

Attention on attention for image captioning

HUANG , L., W ANG , W., C HEN , J., AND WEI, X.-Y. Attention on attention for image captioning. In IEEE International Conference on Computer Vision (October 2019). 2, 14

work page 2019

[14] [14]

What are you talking about? text-to-image coreference

KONG , C., L IN, D., B ANSAL , M., U RTASUN , R., AND FIDLER , S. What are you talking about? text-to-image coreference. In IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 3558–3565. 2

work page 2014

[15] [15]

Openimages: A public dataset for large-scale multi-label and multi-class image classiﬁcation

KRASIN , I., D UERIG , T., A LLDRIN , N., F ERRARI , V., A BU-E L-H AIJA , S., K UZNETSOVA , A., R OM, H., UIJLINGS , J., P OPOV, S., V EIT, A., ET AL . Openimages: A public dataset for large-scale multi-label and multi-class image classiﬁcation. 18. 2

work page

[16] [16]

A., ET AL

KRISHNA , R., Z HU, Y., G ROTH , O., J OHNSON , J., H ATA, K., K RAVITZ , J., C HEN , S., K ALANTIDIS , Y., L I, L.-J., S HAMMA , D. A., ET AL . Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73. 2 2The opinions expressed in this work do not necessarily r...

work page 2017

[17] [17]

ROUGE: A package for automatic evaluation of summaries

LIN, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out (2004), pp. 74–81. 15

work page 2004

[18] [18]

L IN, T.-Y., M AIRE , M., B ELONGIE , S., H AYS, J., P ERONA , P., R AMANAN , D., D OLLÁR , P., AND ZITNICK , C. L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (2014), Springer, pp. 740–755. 2

work page 2014

[19] [19]

HDBSCAN: Hierarchical density based clustering

MCINNES , L., H EALY, J., AND ASTELS , S. HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software 2, 11 (2017), 205. 8

work page 2017

[20] [20]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

MCINNES , L., H EALY, J., AND MELVILLE , J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018). 8

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

BLEU: A method for automatic evaluation of machine translation

PAPINENI , K., R OUKOS , S., W ARD , T., AND ZHU, W.-J. BLEU: A method for automatic evaluation of machine translation. In Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318. 15

work page 2002

[22] [22]

A., W ANG , L., C ERVANTES , C

PLUMMER , B. A., W ANG , L., C ERVANTES , C. M., C AICEDO , J. C., H OCKENMAIER , J., AND LAZEBNIK , S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.International Journal of Computer Vision 123, 1 (May 2017), 74–93. 2

work page 2017

[23] [23]

Collecting image annotations using amazon’s mechanical turk

RASHTCHIAN , C., Y OUNG , P., H ODOSH , M., AND HOCKENMAIER , J. Collecting image annotations using amazon’s mechanical turk. In Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (2010), pp. 139–147. 2

work page 2010

[24] [24]

J., M ARCHERET , E., M ROUEH , Y., ROSS , J., AND GOEL , V

RENNIE , S. J., M ARCHERET , E., M ROUEH , Y., ROSS , J., AND GOEL , V. Self-critical sequence training for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 7008–7024. 14, 15

work page 2017

[25] [25]

MobileNetv2: Inverted residuals and linear bottlenecks

SANDLER , M., H OWARD, A., Z HU, M., Z HMOGINOV , A., AND CHEN , L.-C. MobileNetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 4510–4520. 8

work page 2018

[26] [26]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

SHARMA , P., D ING , N., G OODMAN , S., AND SORICUT , R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers) (2018), pp. 2556–2565. 2

work page 2018

[27] [27]

Textcaps: a dataset for image captioning with reading comprehension

SIDOROV, O., H U, R., R OHRBACH , M., AND SINGH , A. Textcaps: a dataset for image captioning with reading comprehension. In European Conference on Computer Vision (2020), Springer, pp. 742–758. 2

work page 2020

[28] [28]

E., AND BISHOP , C

TIPPING , M. E., AND BISHOP , C. M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 3 (1999), 611–622. 8

work page 1999

[29] [29]

L., AND PARIKH , D

VEDANTAM , R., Z ITNICK , C. L., AND PARIKH , D. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4566–4575. 14, 15

work page 2015

[30] [30]

Criadora do projeto #PraCegoVer incentiva a descrição de imagens na web

WEB PARA TODOS . Criadora do projeto #PraCegoVer incentiva a descrição de imagens na web. http://mwpt.com.br/criadora-do-projeto-pracegover-incentiva-descricao-de-imagens-na-web, 2018. 1

work page 2018

[31] [31]

racism”, “discrimination

ZITNICK , C. L., P ARIKH , D., AND VANDERWENDE , L. Learning the visual interpretation of sentences. In IEEE International Conference on Computer Vision (2013), pp. 1681–1688. 2 18 #PraCegoVer: A Large Dataset for Image Captioning in Portuguese A PREPRINT A #PraCegoVer dataset Here, we present a detailed description of the dataset, a datasheet for the #Pr...

work page arXiv 2013