Who Shapes Brazil's Vaccine Debate? Semi-Supervised Modeling of Stance and Polarization in YouTube's Media Ecosystem
Pith reviewed 2026-05-15 16:11 UTC · model grok-4.3
The pith
Semi-supervised modeling of 1.4 million YouTube comments shows science communicators and digital-native channels host the main pro- and anti-vaccine engagement in Brazil.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Integrating stance labels from the semi-supervised framework with temporal patterns, engagement metrics, and channel types shows that polarization spikes during epidemiological crises but becomes fragmented across vaccines and interaction patterns in the post-pandemic period, with science communication and digital-native channels serving as the primary loci of both supportive and oppositional engagement.
What carries the argument
Semi-supervised stance detection framework that combines self-labeling and self-training to classify comments as pro- or anti-vaccine while integrating channel taxonomy and temporal engagement data.
If this is right
- Public health agencies gain a way to monitor attitude shifts across the entire immunization schedule rather than isolated vaccines.
- Polarization patterns can be tracked in real time during future health crises to guide communication timing.
- Science communication and digital-native channels become priority targets for both supportive messaging and countering opposition.
- Fragmented post-pandemic polarization implies that uniform national strategies may be less effective than vaccine-specific approaches.
Where Pith is reading between the lines
- The same semi-supervised method could be tested on other languages or platforms to check whether science and digital-native channels play similar roles elsewhere.
- Engagement metrics combined with stance could serve as early signals for rising misinformation around new vaccines.
- Channel taxonomy suggests traditional legacy media play a secondary role, pointing to a structural shift in where health debates now occur.
Load-bearing premise
The semi-supervised stance detection framework produces accurate classifications without substantial bias from the labeling process or from YouTube comments failing to represent broader Brazilian public attitudes.
What would settle it
A manual annotation of a random sample of several thousand comments or a direct comparison against independent national surveys of vaccine attitudes would confirm or refute the accuracy of the automated stance labels.
Figures
read the original abstract
Vaccination remains a cornerstone of global public health, yet the COVID-19 pandemic exposed how online misinformation, political polarization, and declining institutional trust can undermine immunization efforts. Most of the prior computational studies that analyzed vaccine discourse on social platforms focus on English-language data, specific vaccines, or short time windows, impairing our understanding of long-term dynamics in high-impact, non-English contexts like Brazil, home to one of the world's most comprehensive immunization systems. We here present the largest longitudinal study of Brazil's vaccine discourse on YouTube, leveraging a semi-supervised stance detection framework that combines self-labeling and self-training to classify nearly 1.4 million comments. By integrating stance with temporal patterns, engagement metrics, and channel taxonomy (legacy media, science communicators, digital-native outlets), we map how pro- and anti-vaccine narratives evolve and circulate within a hybrid media ecosystem. Our results show that semi-supervised learning substantially improves stance classification robustness, enabling fine-grained tracking of public attitudes across Brazil's full immunization schedule. Polarization spikes during epidemiological crises, especially COVID-19, but becomes fragmented across vaccines and interaction patterns in the post-pandemic period. Notably, science communication and digital-native channels emerge as the primary loci of both supportive and oppositional engagement, revealing structural vulnerabilities in contemporary health communication. Thus, our work advances computational methods for large-scale stance modeling while offering actionable evidence for public health agencies, platform governance, and online information ecosystems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the largest longitudinal study of Brazil's vaccine discourse on YouTube, classifying nearly 1.4 million comments via a semi-supervised stance detection framework that combines self-labeling and self-training. Integrating stance labels with temporal patterns, engagement metrics, and a channel taxonomy (legacy media, science communicators, digital-native outlets), it claims that semi-supervised learning substantially improves classification robustness, that polarization spikes during epidemiological crises (especially COVID-19) but fragments post-pandemic, and that science communication and digital-native channels are the primary loci of both supportive and oppositional engagement.
Significance. If the stance classifications prove accurate and low-bias, the work would constitute a significant contribution by providing the first large-scale, long-term mapping of vaccine attitudes in a non-English, high-impact public-health context, advancing semi-supervised methods for stance modeling while generating actionable evidence on media-ecosystem vulnerabilities for public-health agencies and platform governance.
major comments (2)
- [Methods] Methods section: The central claim that semi-supervised learning (self-labeling + self-training) substantially improves stance classification robustness is unsupported by any reported held-out validation metrics. No precision, recall, or F1 scores on a manually annotated test set separate from the seed labels are provided, nor are ablation results isolating the self-training gain or inter-annotator agreement for the initial seeds. This is load-bearing for all downstream polarization and channel-taxonomy findings.
- [Results] Results section: Without an error analysis or bias audit of the iterative labeling process, it is impossible to rule out systematic misclassification (e.g., differential performance on anti-vaccine comments), which would propagate into the reported spikes during COVID-19 and the post-pandemic fragmentation across vaccines and interaction patterns.
minor comments (1)
- [Abstract] Abstract and §1: The exact number of comments after filtering, the precise definition of the channel taxonomy, and the temporal window boundaries should be stated explicitly for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The comments highlight important gaps in the validation of our semi-supervised stance detection pipeline and the need for greater transparency regarding potential classification biases. We address each point below and will incorporate the suggested analyses into a revised manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section: The central claim that semi-supervised learning (self-labeling + self-training) substantially improves stance classification robustness is unsupported by any reported held-out validation metrics. No precision, recall, or F1 scores on a manually annotated test set separate from the seed labels are provided, nor are ablation results isolating the self-training gain or inter-annotator agreement for the initial seeds. This is load-bearing for all downstream polarization and channel-taxonomy findings.
Authors: We acknowledge that the manuscript as submitted does not include held-out validation metrics, ablation studies, or inter-annotator agreement statistics for the seed labels. This omission weakens the support for our claim of improved robustness. In the revision we will add a new subsection to the Methods that describes the creation of a manually annotated held-out test set (distinct from the seed labels), reports inter-annotator agreement, and presents precision, recall, and F1 scores for both a supervised baseline and the full semi-supervised model. We will also include ablation experiments that isolate the contribution of the self-training stage. These additions will directly substantiate the methodological claims before the downstream polarization analyses. revision: yes
-
Referee: [Results] Results section: Without an error analysis or bias audit of the iterative labeling process, it is impossible to rule out systematic misclassification (e.g., differential performance on anti-vaccine comments), which would propagate into the reported spikes during COVID-19 and the post-pandemic fragmentation across vaccines and interaction patterns.
Authors: We agree that the absence of an error analysis leaves open the possibility of systematic misclassification, particularly for anti-vaccine content. In the revised manuscript we will insert a dedicated error-analysis subsection in the Results. This will include (1) a manual review of a stratified sample of comments labeled pro- and anti-vaccine by the final model, (2) quantitative assessment of differential error rates across stance classes and time periods, and (3) discussion of how any observed biases could affect the reported temporal spikes and post-pandemic fragmentation patterns. We will also add a limitations paragraph that explicitly addresses the implications for the channel-taxonomy findings. revision: yes
Circularity Check
No significant circularity in semi-supervised stance modeling
full rationale
The paper's core pipeline ingests raw YouTube comments, applies self-labeling plus self-training to produce stance labels, then derives temporal polarization, channel taxonomy, and engagement statistics from those labels. No step equates an output quantity to its own input by definition, renames a fitted parameter as a prediction, or relies on a self-citation chain to establish uniqueness. The semi-supervised process operates on new data without presupposing the polarization or fragmentation results it later reports, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
d.].DATASUS - Informações de Saúde (TABNET)
[n. d.].DATASUS - Informações de Saúde (TABNET). https://datasus.saude.gov.br/ informacoes-de-saude-tabnet/
-
[2]
Roland P Abao, Ma Regina Justina E Estuar, Anna Angeline M Cataluña, Jelly P Aureus, and Dorothy C Mapua. 2021. Emotion analysis of comments from vaccine-related YouTube videos: Understanding the public’s response to COVID- 19 vaccination. InIEEE SNAMS. 1–7
work page 2021
-
[3]
Malak Alsabban. 2021. Comparing two sentiment analysis approaches by under- stand the hesitancy to COVID-19 vaccine based on Twitter data in two cultures. InCompanion of ACM WebSci. 143–144
work page 2021
-
[4]
Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, and Yury Maximov. 2025. Self-training: A survey.Neurocomputing616 (2025), 128904
work page 2025
-
[5]
Irfan Aygün, Buket Kaya, and Mehmet Kaya. 2021. Aspect based twitter senti- ment analysis on vaccination and vaccine types in covid-19 pandemic with deep learning.IEEE J-BHI26, 5 (2021), 2360–2369
work page 2021
-
[6]
Ann J Barbier, Allen Yujie Jiang, Peng Zhang, Richard Wooster, and Daniel G Anderson. 2022. The clinical progress of mRNA vaccines and immunotherapies. Nat. Biotechnol.(2022)
work page 2022
-
[7]
Yoshua Bengio, Ian Goodfellow, Aaron Courville, et al. 2017.Deep learning. Vol. 1. MIT press Cambridge, MA, USA
work page 2017
-
[8]
Guillermo Blanco, Rubén Yáñez Martínez, and Anália Lourenço. 2025. Leveraging deep learning to detect stance in Spanish tweets on COVID-19 vaccination.JAMIA open8, 1 (2025), ooaf007
work page 2025
-
[9]
Rebecca M Casey, Jennifer B Harris, Steve Ahuka-Mundeke, Meredith G Dixon, Gabriel M Kizito, Pierre M Nsele, Grace Umutesi, Janeen Laven, Olga Kosoy, Gilson Paluku, et al. 2019. Immunogenicity of fractional-dose vaccine during a yellow fever outbreak.N. Engl. J. Med.(2019)
work page 2019
-
[10]
Jessica Costa, Geovana Oliveira, Guilherme Fonseca, Davi Reis, Giancarlo Oliveira Teixeira, Washington Cunha, Leonardo Rocha, and Carlos HG Ferreira
-
[11]
InProceedings of the 17th ACM Web Science Conference 2025
Characterizing YouTube’s Role in Online Gambling Promotion: A Case Study of Fortune Tiger in Brazil. InProceedings of the 17th ACM Web Science Conference 2025. 42–51
work page 2025
-
[12]
Saul Sousa da Rocha, Carlos Henrique do Vale, Carlos HG Ferreira, Glauber Dias Gonçalves, Jussara Marques de Almeida, et al . 2024. Monitorando a opinião pública sobre operações policiais no brasil via comentários de vídeos no youtube. InBrazilian Workshop on Social Network Analysis and Mining (BraSNAM). SBC, 158–171
work page 2024
-
[13]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLORA: efficient finetuning of quantized LLMs(NeurIPS). Article 441, 28 pages. WebSci ’26, May 26–29, 2026, Braunschweig, Germany Oliveira, et al
work page 2023
-
[14]
Aline Dias, Richardy R Tanure, Jussara M Almeida, Helen CSC Lima, and Car- los HG Ferreira. 2024. Análise da Percepção do Uso de Cigarros Eletrônicos no Brasil por meio de Comentários no YouTube. InBrazilian Symposium on Multimedia and the Web
work page 2024
-
[15]
Jingcheng Du, Chongliang Luo, Ross Shegog, Jiang Bian, Rachel M Cunningham, Julie A Boom, Gregory A Poland, Yong Chen, and Cui Tao. 2020. Use of deep learning to analyze social media discussions about the human papillomavirus vaccine.JAMA Netw. Open.3, 11 (2020), e2022025
work page 2020
-
[16]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407
work page 2024
-
[17]
Alessandra Fallucca, Walter Priano, Alessandro Carubia, Patrizia Ferro, Vincenzo Pisciotta, Alessandra Casuccio, and Vincenzo Restivo. 2024. Effectiveness of Catch-Up Vaccination Interventions Versus Standard or Usual Care Procedures in Increasing Adherence to Recommended Vaccinations Among Different Age Groups: Systematic Review and Meta-Analysis of Rand...
work page 2024
-
[18]
Medina Ferreira, Ana Paula Couto da Silva, and Fabricio Murai
Rafael S. Medina Ferreira, Ana Paula Couto da Silva, and Fabricio Murai. 2022. Risk Perception and Misinformation in Brazilian Twitter during COVID-19 Infodemic. InIEEE SocialCom
work page 2022
-
[19]
Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull.76, 5 (1971), 378
work page 1971
-
[20]
Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. 1981. The comparison of proportions from several independent samples.Statistical methods for rates and proportions(1981)
work page 1981
-
[21]
Guilherme Fonseca, Washington Cunha, Gabriel Prenassi, Marcos André Gonçalves, and Leonardo Chaves Dutra Da Rocha. 2025. Instance-selection- inspired undersampling strategies for bias reduction in small and large language models for binary text classification. InACL. 9323–9340
work page 2025
-
[22]
Da Fonseca, Carlos Henrique Gomes Ferreira, and Julio Ce- sar Soares Dos Reis
Luis Guilherme G. Da Fonseca, Carlos Henrique Gomes Ferreira, and Julio Ce- sar Soares Dos Reis. 2024. The Role of News Source Certification in Shaping Tweet Content: Textual and Dissemination Patterns in Brazil’s 2022 Elections. In Brazilian Symp. on Inform. Syst.1–10
work page 2024
-
[23]
2011.Entropy and information theory
Robert M Gray. 2011.Entropy and information theory. Springer
work page 2011
-
[24]
Jie Gui, Tuo Chen, Jing Zhang, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao. 2024. A survey on self-supervised learning: Algorithms, applications, and future trends.IEEE TPAMI46, 12 (2024), 9052–9071
work page 2024
-
[25]
Liang-Chin Huang, Amanda L Eiden, Long He, Augustine Annan, Siwei Wang, Jingqi Wang, Frank J Manion, Xiaoyan Wang, Jingcheng Du, Lixia Yao, et al
-
[26]
Natural Language Processing–Powered Real-Time Monitoring Solution for Vaccine Sentiments and Hesitancy on Social Media: System Development and Validation.JMIR Med. Inform.12, 1 (2024), e57164
work page 2024
-
[27]
Juwon Hwang, Min-Hsin Su, Xiaoya Jiang, Ruixue Lian, Arina Tveleneva, and Dhavan Shah. 2022. Vaccine discourse during the onset of the COVID-19 pan- demic: Topical structure and source patterns informing efforts to combat vaccine hesitancy.Plos one17, 7 (2022), e0271394
work page 2022
-
[28]
IBOPE. 2023. Video Audience Share Percentage in Brazil. https:// kantaribopemedia.com/conteudo/relatorios/april-2023/
work page 2023
-
[29]
Florian Kunneman, Mattijs Lambooij, Albert Wong, Antal van den Bosch, and Liesbeth Mollema. 2020. Monitoring stance towards vaccination in twitter mes- sages.BMC Med. Inform. Decis. Mak.20, 1 (2020), 33
work page 2020
-
[30]
Marin Lahouati, Antoine De Coucy, Jean Sarlangue, and Charles Cazanave. 2020. Spread of vaccine hesitancy in France: What about YouTube™?Vaccine38, 36 (2020), 5779–5782
work page 2020
-
[31]
J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data.Biometrics(1977), 159–174
work page 1977
-
[32]
Heidi J Larson. 2022. Defining and measuring vaccine hesitancy.Nat. Hum. Behav.6, 12 (2022), 1609–1610
work page 2022
-
[33]
Marcelo Sartori Locatelli, Josemar Caetano, Wagner Meira Jr, and Virgilio Almeida. 2022. Characterizing vaccination movements on YouTube in the United States and Brazil. InACM HT
work page 2022
-
[34]
Larissa Malagoli, Júlia Stancioli, Carlos HG Ferreira, Marisa Vasconcelos, Ana Paula Couto da Silva, and Jussara Almeida. 2021. Caracterizaçao do debate no twitter sobre a vacinaçao contra a covid-19 no brasil. InBrazilian Workshop on Social Network Analysis and Mining (BraSNAM). SBC, 55–66
work page 2021
-
[35]
Larissa G Malagoli, Julia Stancioli, Carlos HG Ferreira, Marisa Vasconcelos, Ana Paula Couto da Silva, and Jussara M Almeida. 2021. A look into covid- 19 vaccination debate on twitter. InProceedings of the 13th ACM Web Science Conference 2021. 225–233
work page 2021
-
[36]
Chad A Melton, Olufunto A Olusanya, Nariman Ammar, and Arash Shaban- Nejad. 2021. Public sentiment analysis and topic modeling regarding COVID-19 vaccines on the Reddit social media platform: A call to action for strengthening vaccine confidence.J. Infect. Public Health(2021)
work page 2021
-
[37]
Chad A Melton, Brianna M White, Robert L Davis, Robert A Bednarczyk, and Arash Shaban-Nejad. 2022. Fine-tuned sentiment analysis of covid-19 vaccine– related social media data: Comparative study.JMIR24, 10 (2022), e40408
work page 2022
-
[38]
Tamar Mitts, Nilima Pisharody, and Jacob Shapiro. 2022. Removal of anti-vaccine content impacts social media discourse. InACM WebSci. 319–326
work page 2022
-
[39]
Kunihiro Miyazaki, Takayuki Uchiba, Haewoon Kwak, Jisun An, and Kazutoshi Sasahara. 2024. The impact of toxic trolling comments on anti-vaccine YouTube videos.Sci. Rep.14, 1 (2024), 5088
work page 2024
-
[40]
Bjarke Mønsted and Sune Lehmann. 2022. Characterizing polarization in online vaccine discourse—A large-scale study.PloS one17, 2 (2022), e0263746
work page 2022
-
[41]
Gabriel P Nobre, Carlos HG Ferreira, and Jussara M Almeida. 2022. More of the same? a study of images shared on mastodon’s federated timeline. InInternational Conference on Social Informatics. Springer, 181–195
work page 2022
-
[42]
Geovana S Oliveira, João Pedro Lobo, Otávio Venâncio, Vinícius da F Vieira, Jussara M Almeida, Ana PC Silva, Ronan S Ferreira, and Carlos HG Ferreira
-
[43]
A Network-Driven Framework for Bidimensional Analysis of Information Dissemination on Social Media Platforms.Journal on Interactive Systems16, 1 (2025), 773–794
work page 2025
-
[44]
Geovana S Oliveira, Otávio Venâncio, Vinícius Vieira, Jussara Almeida, Ana PC Silva, Ronan Ferreira, and Carlos HG Ferreira. 2024. Um framework para análise bidimensional de disseminação de informações em plataformas de mídias sociais. InBrazilian Symposium on Multimedia and the Web (WebMedia). SBC, 301–309
work page 2024
-
[45]
World Health Organization et al. 2022. Behavioural and social drivers of vaccina- tion: tools and practical guidance for achieving high uptake. (2022)
work page 2022
-
[46]
Yang Pan, Quanyi Wang, Peng Yang, Li Zhang, Shuangsheng Wu, Yi Zhang, Ying Sun, Wei Duan, Chunna Ma, Man Zhang, et al. 2017. Influenza vaccination in preventing outbreaks in schools: A long-term ecological overview.Vaccine35, 51 (2017), 7133–7138
work page 2017
-
[47]
Jadher Pércio, Eder Gatti Fernandes, Ethel Leonor Maciel, and Nísia Verônica Trindade de Lima. 2023. 50 years of the Brazilian National Immunization Program and the Immunization Agenda 2030.Epidemiologia e Serviços de Saúde32 (2023), e20231009
work page 2023
-
[48]
Miftahul Qorib, Timothy Oladunni, Max Denis, Esther Ososanya, and Paul Cotae
-
[49]
Covid-19 vaccine hesitancy: Text mining, sentiment analysis and machine learning on COVID-19 vaccination Twitter dataset.Expert Syst. Appl.212 (2023), 118715
work page 2023
-
[50]
Guilherme O Santos, Lucas S Vieira, Giulio Rossetti, Carlos HG Ferreira, and Gladston JP Moreira. 2025. A high-performance evolutionary multiobjective community detection algorithm.Social Network Analysis and Mining15, 1 (2025), 110
work page 2025
-
[51]
Romy Sauvayre, Jessica Vernier, and Cédric Chauvière. 2022. An analysis of French-language tweets about COVID-19 vaccines: Supervised learning approach. JMIR Med. Inform.10, 5 (2022), e37831
work page 2022
-
[52]
Brener Santos Silva, Eliete Albano de Azevedo Guimarães, Valéria Conceição de Oliveira, Ricardo Bezerra Cavalcante, Marta Macedo Kerr Pinheiro, Tarcísio Laerte Gontijo, Samuel Barroso Rodrigues, Ana Paula Ferreira, Humberto Ferreira de Oliveira Quites, and Ione Carvalho Pinto. 2020. National immunization program information system: implementation context ...
work page 2020
-
[53]
Melodie Yun-Ju Song and Anatoliy Gruzd. 2017. Examining sentiments and popularity of pro-and anti-vaccination videos on YouTube. InSocial Media + Society. 1–8
work page 2017
-
[54]
Nadiya Straton. 2023. COVID vaccine stigma: detecting stigma across social media platforms with computational model based on deep learning.Appl. Intell. 53, 13 (2023), 16398–16423
work page 2023
-
[55]
Fahim K Sufi, Imran Razzak, and Ibrahim Khalil. 2022. Tracking anti-vax social movement using AI-based social media monitoring.IEEE-TTS3, 4 (2022), 290– 299
work page 2022
-
[56]
Richardy R Tanure, Aline M Dias, Lucas A Camelo, Jussara Almeida, Helen CSC Lima, and Carlos HG Ferreira. 2025. Caracterização do debate online sobre cigarro eletrônico no Brasil: Uma análise de tópicos de discussão no YouTube. InBrazilian Workshop on Social Network Analysis and Mining (BraSNAM). SBC, 54–64
work page 2025
-
[57]
Dayane Fumiyo Tokojima Machado, Alexandre Fioravante de Siqueira, and Leda Gitahy. 2020. Natural stings: Selling distrust about vaccines on Brazilian YouTube. Front. Comm.5 (2020), 577941
work page 2020
-
[58]
Jia Xue, Junxiang Chen, Ran Hu, Chen Chen, Chengda Zheng, Yue Su, and Tingshao Zhu. 2020. Twitter discussions and emotions about the COVID-19 pandemic: Machine learning approach.JMIR22, 11 (2020), e20550
work page 2020
-
[59]
Sihong Zhao, Simeng Hu, Xiaoyu Zhou, Suhang Song, Qian Wang, Hongqiu Zheng, Ying Zhang, Zhiyuan Hou, et al. 2023. The prevalence, features, influ- encing factors, and solutions for COVID-19 vaccine misinformation: systematic review.JPHS9, 1 (2023), e40201
work page 2023
-
[60]
Paola Zola, Costantino Ragno, and Paulo Cortez. 2020. A Google Trends spatial clustering approach for a worldwide Twitter user geolocation.IPM57, 6 (2020), 102312
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.