RELIANCE: Curating and Evaluating Reproductive Health Information on Social Media

Alex Peahl; Alice Chi; Elizabeth Bondi-Kelly; Laura Peyton Ellis; Vaibhav Balloli; Vishala Mishra

arxiv: 2606.18285 · v1 · pith:JWXDCQZAnew · submitted 2026-06-10 · 💻 cs.SI · cs.CY

RELIANCE: Curating and Evaluating Reproductive Health Information on Social Media

Vaibhav Balloli , Laura Peyton Ellis , Vishala Mishra , Alice Chi , Alex Peahl , Elizabeth Bondi-Kelly This is my paper

Pith reviewed 2026-06-27 07:54 UTC · model grok-4.3

classification 💻 cs.SI cs.CY

keywords reproductive healthTikToksocial mediaLLM evaluationfact-checkingdatasetpregnancypostpartum

0 comments

The pith

A clinician-annotated dataset of 336 TikTok videos finds nearly 60 percent of the reproductive health information accurate and exposes a 15 percent gap in how LLMs judge full videos versus individual claims.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RELIANCE as an expert-labeled collection of sentences drawn from TikTok videos answering pregnancy and postpartum questions. Three clinicians reviewed the material and judged nearly 60 percent of the health information accurate. The same annotations were used to test LLMs, revealing that performance drops when models must assess an entire video rather than isolated claims. The authors position the dataset as a resource both for measuring the current information landscape and for improving automated fact-checking tools in a high-stakes domain.

Core claim

RELIANCE supplies 409 annotated sentences from 336 videos that answer 56 clinician-reviewed queries on pregnancy and postpartum topics; three experts in Obstetrics, Gynecology, and Internal Medicine labeled each sentence for accuracy, producing the finding that nearly 60 percent of the sampled health information is accurate and that LLM evaluators exhibit a 15 percent performance gap between judging specific claims and judging complete video content.

What carries the argument

The RELIANCE dataset, built through clinician review of TikTok video transcripts and full content for accuracy on reproductive health queries.

If this is right

Platforms and health organizations can treat the 60 percent accuracy rate as a baseline for the amount of reliable versus unreliable reproductive health content currently circulating.
LLM-based fact-checking systems must be tested on full video context rather than isolated claims if they are to be deployed in reproductive health.
The annotation protocol and dataset release provide a reusable template for creating similar ground-truth collections on other social platforms or in other health domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same clinician-review process to short-form video on additional platforms would allow direct comparison of accuracy rates across services.
The observed gap between claim-level and video-level evaluation suggests that training data for LLMs should include more examples of holistic content assessment.
Health professionals could use the dataset to identify the most common categories of inaccurate claims and target public-education efforts accordingly.

Load-bearing premise

The 56 queries and 336 videos chosen for annotation are representative of the wider reproductive health content on TikTok, and the three clinicians' labels supply a reliable ground truth.

What would settle it

A larger-scale annotation effort using thousands of additional videos drawn from a wider query pool that yields a materially different accuracy rate or larger inter-clinician disagreement would undermine the reported 60 percent figure and the 15 percent LLM gap.

Figures

Figures reproduced from arXiv: 2606.18285 by Alex Peahl, Alice Chi, Elizabeth Bondi-Kelly, Laura Peyton Ellis, Vaibhav Balloli, Vishala Mishra.

**Figure 1.** Figure 1: Overview of RELIANCE. We collect a dataset of health information from TikTok to support real-world queries faced during pregnancy. We analyze this data and evaluate LLMs’ capabilities in detecting inaccurate and harmful information. Abstract Social media platforms like TikTok have become a key source of health information, with studies reporting inaccuracies in posts. As Large Language Model (LLM) provider… view at source ↗

**Figure 2.** Figure 2: Illustration of how inaccurate information can be [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Annotation process and example illustration of a row in our [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset analysis of distribution of annotations by (a) Stages, (b) Categories, and (c) correlation of labels with engagement [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Performance metrics (accuracy, precision, recall) and calibration metrics (ECE and Brier Score) on detecting inaccurate [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Comparing all metrics against the scale of models [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Screenshot of the application to collect data from [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Combined figure showing Image 1 on the left, and [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Figure 5a as bar plots Annotation-Level Video-Level 0 20 40 60 80 100 % Accuracy Annotation-Level Video-Level 0 20 40 60 80 100 120 % Precision Annotation-Level Video-Level 0 20 40 60 80 100 % Recall Annotation-Level Video-Level 0 10 20 30 40 50 60 70 % ECE Annotation-Level Video-Level 0 10 20 30 40 50 60 70 % Brier Model GPT-4o-mini Gemini-2.5-Flash Qwen3-14B MedGemma [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 11.** Figure 11: Figure 5b as bar plots domain is weighed ([0,1)) with the relevance score of the document. Some of the top occurring categories of websites are health information blogs (pnmag.com,healthline.com), Wikipedia, news organizations, and businesses selling products. Surprisingly, exact matches from reputed sources like the Mayo Clinic and ACOG have little presence. C.2 Qualitative Analysis [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 12.** Figure 12: Word cloud of bigrams extracted from the reasons. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Digital platforms promoting various kinds of information seeking behaviors [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

read the original abstract

Social media platforms like TikTok have become a key source of health information, with studies reporting inaccuracies in posts. As Large Language Model (LLM) providers increasingly integrate LLMs into digital platforms to fact-check content (e.g., Grok and Perplexity on X and WhatsApp, respectively) and are being used by people to fact-check information, deploying these systems in critical areas such as reproductive health without rigorous evaluation can cause serious harm. We introduce RELIANCE, an expert-annotated dataset of health information on TikTok surrounding pregnancy and postpartum queries, serving as both an analysis of the reproductive health information landscape and an evaluation of LLMs' capabilities in fact-checking this content. Our dataset comprises 409 annotated sentences from 336 videos across 56 clinician-reviewed queries, annotated by three expert clinicians in Obstetrics, Gynecology, and Internal Medicine. Our findings reveal that nearly 60\% of the health information in the videos we sampled is accurate. Furthermore, LLM evaluations reveal a gap between evaluating specific claims and evaluating the entire content (15\%). We believe that our methodology, dataset, and tool will support the machine learning community in improving LLMs for important domains with real-world data, extending to other platforms and languages, and helping the health community further understand the information landscape on social media. Our dataset and code are made available at https://realize-lab.github.io/RELIANCE/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RELIANCE releases a usable annotated dataset on TikTok health content, but sampling details are missing so the accuracy number is hard to generalize.

read the letter

The main takeaway here is a new public dataset of annotated TikTok content on reproductive health, with some LLM testing on top.

They collected 56 clinician-reviewed queries, sampled 336 videos, and had three clinicians label 409 sentences. The accuracy comes in at nearly 60 percent for the sampled videos. They also report that LLMs do better or worse depending on whether they judge isolated claims or the full video, with a 15 percent gap.

Releasing the data and code is the strongest part. It gives others a concrete set to work with in a domain where real examples matter.

The sampling process is the soft spot. The queries are described only as clinician-reviewed, with no account of the starting pool or selection rules. How the videos were chosen for each query is also left out. That makes it hard to treat the 60 percent as more than a snapshot of their particular draw. The paper frames this as an analysis of the landscape, but the numbers stay conditional on the sample.

No numbers on agreement between the annotators show up in the abstract either.

This is worth a look for anyone building tools to check health claims on social media or testing LLMs in sensitive areas. The dataset could see use even if the broader claims need more support.

I would send it through peer review. The data contribution is real, and the reviewers can ask for the missing details on how the videos and queries were picked.

Referee Report

3 major / 2 minor

Summary. The paper introduces RELIANCE, an expert-annotated dataset of 409 sentences from 336 TikTok videos across 56 clinician-reviewed pregnancy and postpartum queries, annotated by three clinicians in Obstetrics, Gynecology, and Internal Medicine. It reports that nearly 60% of the sampled health information is accurate and identifies a 15% performance gap for LLMs between evaluating specific claims versus entire video content. The work positions the dataset as both an analysis of the reproductive health information landscape on social media and a benchmark for LLM fact-checking, with code and data released.

Significance. If the sampling supports generalization, the dataset offers a concrete, expert-annotated resource for evaluating LLMs on reproductive health content from TikTok, addressing a timely need as LLMs are integrated into fact-checking tools. The open release of the dataset and code is a clear strength that enables reproducibility and extension to other platforms or languages.

major comments (3)

[Methods (query curation)] Methods section on query curation: the 56 clinician-reviewed queries are described only as 'clinician-reviewed' with no account of the initial query pool, selection criteria, stratification by search volume or popularity, or how they represent typical user searches; this directly undermines the claim that the 60% accuracy figure characterizes the broader reproductive health information landscape on TikTok.
[Methods (video sampling)] Methods section on video sampling: the procedure for selecting the 336 videos per query (top-N results, random sample, date range, or other) is unspecified, so the accuracy statistic and the LLM evaluation's relevance to real-world use rest on an uncharacterized convenience sample.
[Results (LLM evaluation)] Results (LLM evaluation) and annotation protocol: the 15% gap between claim-level and content-level evaluation is reported without details on how 'entire content' evaluation was operationalized by clinicians, inter-annotator agreement statistics, or exclusion criteria, making it impossible to verify that the gap is reliably measured and load-bearing for the LLM assessment claim.

minor comments (2)

[Abstract] Abstract and §1: the sentence count (409) and video count (336) are stated without an explicit mapping or table showing how many sentences per video were annotated on average.
[Discussion] The paper should add a limitations subsection clarifying the scope of the TikTok-only, English-language, pregnancy/postpartum focus.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important areas for improving the clarity and rigor of our methods and results sections. We address each major comment below and will make corresponding revisions to the manuscript.

read point-by-point responses

Referee: Methods section on query curation: the 56 clinician-reviewed queries are described only as 'clinician-reviewed' with no account of the initial query pool, selection criteria, stratification by search volume or popularity, or how they represent typical user searches; this directly undermines the claim that the 60% accuracy figure characterizes the broader reproductive health information landscape on TikTok.

Authors: We agree that the current description is insufficient and that the 60% accuracy statistic should not be presented as characterizing the broader landscape without qualification. In the revised manuscript we will expand the Methods section to describe the initial query pool (common pregnancy/postpartum search terms drawn from public health resources), the clinician review process used to finalize the 56 queries, and any stratification or popularity considerations applied. We will also revise the abstract and results to explicitly limit the scope of the accuracy claim to the sampled queries rather than implying broad generalization. revision: yes
Referee: Methods section on video sampling: the procedure for selecting the 336 videos per query (top-N results, random sample, date range, or other) is unspecified, so the accuracy statistic and the LLM evaluation's relevance to real-world use rest on an uncharacterized convenience sample.

Authors: We acknowledge that the video sampling procedure must be fully specified. In the revision we will add a detailed description of how videos were retrieved for each query (including search parameters, ranking method, date range, and number of videos collected per query) so that readers can assess the nature of the sample. We will also note any limitations regarding representativeness. revision: yes
Referee: Results (LLM evaluation) and annotation protocol: the 15% gap between claim-level and content-level evaluation is reported without details on how 'entire content' evaluation was operationalized by clinicians, inter-annotator agreement statistics, or exclusion criteria, making it impossible to verify that the gap is reliably measured and load-bearing for the LLM assessment claim.

Authors: We agree that the annotation protocol for entire-content evaluation requires additional detail. In the revised manuscript we will expand the relevant section to describe how clinicians operationalized full-video assessment, report inter-annotator agreement statistics (or explain why they were not computed), and list any exclusion criteria applied. This will allow readers to evaluate the reliability of the reported 15% gap. revision: yes

Circularity Check

0 steps flagged

Empirical dataset curation and evaluation with no derivations or self-referential loops

full rationale

The paper performs data collection from TikTok, clinician annotation of 409 sentences across 336 videos, direct percentage calculation of accuracy, and separate LLM evaluation experiments. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The 60% accuracy figure is a straightforward aggregate of the three-clinician annotations on the sampled content, not a reduction of any prior result or definition. Representativeness concerns affect external validity but do not constitute circularity under the specified patterns, as there are no self-definitional steps, fitted-input predictions, or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the paper is an empirical dataset curation and evaluation study.

pith-pipeline@v0.9.1-grok · 5798 in / 1232 out tokens · 20746 ms · 2026-06-27T07:54:57.182767+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 24 canonical work pages · 9 internal anchors

[1]

B. L. Aaron, K. E. Neff, J. Wu, F. Cai, J. J. Swartz, and L. P. Burns. Labor induction in the age of tiktok: what are influencers teaching patients about oxytocin infusion? American journal of obstetrics & gynecology MFM, 5(11), 2023

2023
[2]

M. L. Antheunis, K. Tates, and T. E. Nieboer. Patients’ and health professionals’ use of social media in health care: motives, barriers and expectations.Patient education and counseling, 92(3):426–431, 2013

2013
[3]

Antoniak, A

M. Antoniak, A. Naik, C. S. Alvarado, L. L. Wang, and I. Y. Chen. Nlp for maternal healthcare: Perspectives and guiding principles in the age of llms. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1446–1463, 2024

2024
[4]

R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsim- pourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Sing- hal. Healthbench: Evaluating large language models towards improved human health. https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/ healthbench_paper.pdf#page=1.35, 2025. URL https://cdn....

2025
[5]

Auxier and M

B. Auxier and M. Anderson. Social media use in 2021. https://www.pewresearch. org/internet/2021/04/07/social-media-use-in-2021/, 2021. URL https://www. pewresearch.org/internet/2021/04/07/social-media-use-in-2021/

2021
[6]

Balloli, J

V. Balloli, J. Erickson, X. Li, E. MacMurray van Liemt, A. Friedman Peahl, and E. Bondi-Kelly. ”Where is this coming from?” Uncovering Trustworthiness Ideals in AI-powered Peripartum Information Seeking. InProceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency, 2026. doi: 10.1145/3805689.3812277

work page doi:10.1145/3805689.3812277 2026
[7]

A. M. Buck and D. F. Ralston. I didn’t sign up for your research study: The ethics of using “public” data.Computers and Composition, 61:102655, 2021

2021
[8]

Budak, B

C. Budak, B. Nyhan, D. M. Rothschild, E. Thorson, and D. J. Watts. Misunder- standing the harms of online misinformation.Nature, 630(8015):45–53, 2024

2024
[9]

Chen and K

C. Chen and K. Shu. Can llm-generated misinformation be detected?arXiv preprint arXiv:2309.13788, 2023

work page arXiv 2023
[10]

J. Chen, Y. Wang, et al. Social media use for health purposes: systematic review. Journal of medical Internet research, 23(5):e17917, 2021

2021
[11]

De Choudhury, M

M. De Choudhury, M. R. Morris, and R. W. White. Seeking and sharing health information online: comparing search engines and social media. InProceedings of the SIGCHI conference on human factors in computing systems, pages 1365–1376, 2014

2014
[12]

R. Dorn, L. Kezar, F. Morstatter, and K. Lerman. Harmful speech detection by language models exhibits gender-queer dialect bias. InProceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–12, 2024

2024
[13]

A. G. Dunn, K. D. Mandl, and E. Coiera. Social media interventions for precision public health: promises and risks.NPJ digital medicine, 1(1):47, 2018

2018
[14]

K. Eddy. https://www.pewresearch.org/short-reads/2024/12/20/8-facts-about- americans-and-tiktok/, 2024. URL https://www.pewresearch.org/short-reads/ 2024/12/20/8-facts-about-americans-and-tiktok/

2024
[15]

Eysenbach, J

G. Eysenbach, J. Powell, M. Englesakis, C. Rizo, and A. Stern. Health related virtual communities and electronic support groups: systematic review of the effects of online peer to peer interactions.Bmj, 328(7449):1166, 2004

2004
[16]

Farahbakhsh, A

R. Farahbakhsh, A. Cuevas, and N. Crespi. Characterization of cross-posting activity for professional users across major osns. InProceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pages 645–650, 2015

2015
[17]

for Disease Control and P

C. for Disease Control and P. (CDC). Hear her campaign. https://www.cdc.gov/ hearher/index.html, 2024

2024
[18]

Gottfried and E

J. Gottfried and E. Shearer. News use across social media platforms 2016.Pew Research Center, 2016

2016
[19]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Stein- hardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[20]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021
[21]

G. J. Johnson and P. J. Ambrose. Neo-tribes: The power and potential of online communities in health care.Communications of the ACM, 49(1):107–113, 2006

2006
[22]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Racial disparities in maternal and infant health: Current status and key issues

Kaiser Family Foundation. Racial disparities in maternal and infant health: Current status and key issues. https://www.kff.org/racial-equity-and-health- policy/racial-disparities-in-maternal-and-infant-health-current-status-and- key-issues/, 2025

2025
[24]

Kanchan and A

S. Kanchan and A. Gaidhane. Social media role and its impact on public health: a narrative review.Cureus, 15(1), 2023

2023
[25]

S. S. Khan, N. A. Cameron, and K. J. Lindley. Pregnancy as an early cardiovascular moment: peripartum cardiovascular health.Circulation research, 132(12):1584– 1606, 2023

2023
[26]

M. H. Kwon, S. W. Kwon, R. K. Das, and B. C. Drolet. Obstetric and gynecologic care in tiktok: Top influencers and posts.Reproductive Sciences, 30(10):2889–2892, 2023

2023
[27]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

J. Liu, S. Min, L. Zettlemoyer, Y. Choi, and H. Hajishirzi. Infini-gram: Scal- ing unbounded n-gram language models to a trillion tokens.arXiv preprint arXiv:2401.17377, 2024

work page arXiv 2024
[29]

J. Liu, T. Blanton, Y. Elazar, S. Min, Y. Chen, A. Chheda-Kothary, H. Tran, B. Bischoff, E. Marsh, M. Schmitz, et al. Olmotrace: Tracing language model outputs back to trillions of training tokens.arXiv preprint arXiv:2504.07096, 2025

work page arXiv 2025
[30]

A. . M. Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/ abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Mannhardt, E

N. Mannhardt, E. Bondi-Kelly, B. Lam, H. Mozannar, C. O’Connell, M. Asiedu, A. Buendia, T. Urman, I. B. Riaz, C. E. Ricciardi, et al. Impact of large language model assistance on patients reading clinical notes: a mixed-methods study.arXiv preprint arXiv:2401.09637, 2024

work page arXiv 2024
[32]

Markov, C

T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng. A holistic approach to undesired content detection in the real world. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023

2023
[33]

arXiv preprint arXiv:2305.14552 , year=

N. McKenna, T. Li, L. Cheng, M. J. Hosseini, M. Johnson, and M. Steedman. Sources of hallucination by large language models on inference tasks.arXiv preprint arXiv:2305.14552, 2023

work page arXiv 2023
[34]

More speech and fewer mistakes

Meta. More speech and fewer mistakes. https://about.fb.com/news/2025/01/meta- more-speech-fewer-mistakes/, Jan. 2025

2025
[35]

S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.arXiv preprint arXiv:2305.14251, 2023

work page arXiv 2023
[36]

Montero, J

A. Montero, J. Montalvo III, A. Kearney, I. Valdes, A. Kirzinger, and L. Hamel. Kff tracking poll on health information and trust: Use of ai for health information and advice.KFF Poll, March, pages 435–444, 2026

2026
[37]

National Academies of Sciences, Engineering, and Medicine and others. Overview of research gaps for selected conditions in women’s health research at the national institutes of health: Proceedings of a workshop—in brief.National Academies of Sciences, Engineering, and Medicine; Health and Medicine Division, 2024

2024
[38]

Q. C. Nguyen, E. M. Aparicio, M. Jasczynski, A. C. Doig, X. Yue, H. Mane, N. Srikanth, F. X. M. Gutierrez, N. Delcid, X. He, et al. Rosie, a health edu- cation question-and-answer chatbot for new mothers: randomized pilot study. JMIR Formative Research, 8(1):e51361, 2024

2024
[39]

A. C. of Obstetricians and G. (ACOG). Prenatal care. https://www.acog.org/ programs/redesigning-prenatal-care-initiative, 2020

2020
[40]

URL https://platform.openai.com/docs/models/omni-moderation- latest

OpenAI, 2025. URL https://platform.openai.com/docs/models/omni-moderation- latest

2025
[41]

J. C. Penny-Dimri, M. Bachmann, W. R. Cooke, S. Mathewlynn, S. Dockree, J. Tolladay, J. Kossen, L. Li, Y. Gal, and G. D. Jones. Reducing large language model safety risks in women’s health using semantic entropy.arXiv preprint arXiv:2503.00269, 2025

work page arXiv 2025
[42]

Automatic Detection of Fake News

V. Pérez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea. Automatic detection of fake news.arXiv preprint arXiv:1708.07104, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Ramjee, M

P. Ramjee, M. Chhokar, B. Sachdeva, M. Meena, H. Abdullah, A. Vashistha, R. Na- gar, and M. Jain. Ashabot: An llm-powered chatbot to support the informational needs of community health workers.arXiv preprint arXiv:2409.10913, 2024

work page arXiv 2024
[44]

M. Rao. Medfluencing: A new role for physicians in the age of tiktok.Georgetown Medical Review, 6(1), 2022

2022
[45]

Renault, M

T. Renault, M. Mosleh, and D. G. Rand. @grok is this true? llm-powered fact- checking on social media, Jan 2026. URL osf.io/preprints/psyarxiv/85quw_v2

2026
[46]

K. Saab, T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

A. M. Smith, A. S. Mucedola, and A. Ausness-Ayres. My Video, My Choice: A Quantitative Content Analysis of U.S. OBGYN TikTok Videos in the Post-RoeEra. Women’s Reproductive Health, pages 1–25, Aug. 2024. ISSN 2329-3691, 2329-3713. doi: 10.1080/23293691.2024.2377969. URL https://www.tandfonline.com/doi/full/ 10.1080/23293691.2024.2377969

work page doi:10.1080/23293691.2024.2377969 2024
[48]

Srikanth, R

N. Srikanth, R. Sarkar, H. Mane, E. M. Aparicio, Q. C. Nguyen, R. Rudinger, and J. Boyd-Graber. Pregnant questions: The importance of pragmatic awareness in maternal health question answering.arXiv preprint arXiv:2311.09542, 2023

work page arXiv 2023
[49]

Stein, Y

K. Stein, Y. Yao, and T. Aitamurto. Examining Communicative Forms in #TikTok- Docs’ Sexual Health Videos.International Journal of Communication, 16(0):23, Feb
[50]

URL https://ijoc.org/index.php/ijoc/article/view/18175

ISSN 1932-8036. URL https://ijoc.org/index.php/ijoc/article/view/18175. Number: 0. KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Vaibhav Balloli et al

1932
[51]

Stephenson, C

S. Stephenson, C. N. Page, M. Wei, A. Kapadia, and F. Roesner. Sharenting on tiktok: Exploring parental sharing behaviors and the discourse around children’s online privacy. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–17, 2024

2024
[52]

Suarez-Lledo and J

V. Suarez-Lledo and J. Alvarez-Galvez. Prevalence of health misinformation on social media: systematic review.Journal of medical Internet research, 23(1):e17187, 2021

2021
[53]

Y. Sun, J. He, S. Lei, L. Cui, and C.-T. Lu. Med-mmhl: A multi-modal dataset for detecting human-and llm-generated misinformation in the medical domain. arXiv preprint arXiv:2306.08871, 2023

work page arXiv 2023
[54]

J.-J. Tian, H. Yu, Y. Orlovskiy, T. Vergho, M. Rivera, M. Goel, Z. Yang, J.-F. Godbout, R. Rabbany, and K. Pelrine. Web retrieval agents for evidence-based misinforma- tion detection.arXiv preprint arXiv:2409.00009, 2024

work page arXiv 2024
[55]

Völske, P

M. Völske, P. Braslavski, M. Hagen, G. Lezina, and B. Stein. What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries. InProceed- ings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1571–1580, Melbourne Australia, Oct. 2015. ACM. ISBN 978-1-4503-3794-6. doi: 10.1145/2806416.2806457. URL htt...

work page doi:10.1145/2806416.2806457 2015
[56]

L. Wu, F. Morstatter, K. M. Carley, and H. Liu. Misinformation in social media: definition, manipulation, and detection.ACM SIGKDD explorations newsletter, 21 (2):80–90, 2019

2019
[57]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

On Verbalized Confidence Scores for LLMs

D. Yang, Y.-H. H. Tsai, and M. Yamada. On verbalized confidence scores for llms. arXiv preprint arXiv:2412.14737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

W. Zeng, Y. Liu, R. Mullins, L. Peran, J. Fernandez, H. Harkous, K. Narasimhan, D. Proud, P. Kumar, B. Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

but the data is already public

M. Zimmer. “but the data is already public”: on the ethics of research in facebook. InThe ethics of information technologies, pages 229–241. Routledge, 2020

2020
[61]

Oligohydramnios.. diagnosed with < 5 cm

G. Zuccon, B. Koopman, and J. Palotti. Diagnose this if you can: On the effective- ness of search engines in finding medical self-diagnosis information. InAdvances in Information Retrieval: 37th European Conference on IR Research, ECIR 2015, Vienna, Austria, March 29-April 2, 2015. Proceedings 37, pages 562–567. Springer, 2015. A Design choices and artifa...

work page arXiv 2015

[1] [1]

B. L. Aaron, K. E. Neff, J. Wu, F. Cai, J. J. Swartz, and L. P. Burns. Labor induction in the age of tiktok: what are influencers teaching patients about oxytocin infusion? American journal of obstetrics & gynecology MFM, 5(11), 2023

2023

[2] [2]

M. L. Antheunis, K. Tates, and T. E. Nieboer. Patients’ and health professionals’ use of social media in health care: motives, barriers and expectations.Patient education and counseling, 92(3):426–431, 2013

2013

[3] [3]

Antoniak, A

M. Antoniak, A. Naik, C. S. Alvarado, L. L. Wang, and I. Y. Chen. Nlp for maternal healthcare: Perspectives and guiding principles in the age of llms. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1446–1463, 2024

2024

[4] [4]

R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsim- pourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Sing- hal. Healthbench: Evaluating large language models towards improved human health. https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/ healthbench_paper.pdf#page=1.35, 2025. URL https://cdn....

2025

[5] [5]

Auxier and M

B. Auxier and M. Anderson. Social media use in 2021. https://www.pewresearch. org/internet/2021/04/07/social-media-use-in-2021/, 2021. URL https://www. pewresearch.org/internet/2021/04/07/social-media-use-in-2021/

2021

[6] [6]

Balloli, J

V. Balloli, J. Erickson, X. Li, E. MacMurray van Liemt, A. Friedman Peahl, and E. Bondi-Kelly. ”Where is this coming from?” Uncovering Trustworthiness Ideals in AI-powered Peripartum Information Seeking. InProceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency, 2026. doi: 10.1145/3805689.3812277

work page doi:10.1145/3805689.3812277 2026

[7] [7]

A. M. Buck and D. F. Ralston. I didn’t sign up for your research study: The ethics of using “public” data.Computers and Composition, 61:102655, 2021

2021

[8] [8]

Budak, B

C. Budak, B. Nyhan, D. M. Rothschild, E. Thorson, and D. J. Watts. Misunder- standing the harms of online misinformation.Nature, 630(8015):45–53, 2024

2024

[9] [9]

Chen and K

C. Chen and K. Shu. Can llm-generated misinformation be detected?arXiv preprint arXiv:2309.13788, 2023

work page arXiv 2023

[10] [10]

J. Chen, Y. Wang, et al. Social media use for health purposes: systematic review. Journal of medical Internet research, 23(5):e17917, 2021

2021

[11] [11]

De Choudhury, M

M. De Choudhury, M. R. Morris, and R. W. White. Seeking and sharing health information online: comparing search engines and social media. InProceedings of the SIGCHI conference on human factors in computing systems, pages 1365–1376, 2014

2014

[12] [12]

R. Dorn, L. Kezar, F. Morstatter, and K. Lerman. Harmful speech detection by language models exhibits gender-queer dialect bias. InProceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–12, 2024

2024

[13] [13]

A. G. Dunn, K. D. Mandl, and E. Coiera. Social media interventions for precision public health: promises and risks.NPJ digital medicine, 1(1):47, 2018

2018

[14] [14]

K. Eddy. https://www.pewresearch.org/short-reads/2024/12/20/8-facts-about- americans-and-tiktok/, 2024. URL https://www.pewresearch.org/short-reads/ 2024/12/20/8-facts-about-americans-and-tiktok/

2024

[15] [15]

Eysenbach, J

G. Eysenbach, J. Powell, M. Englesakis, C. Rizo, and A. Stern. Health related virtual communities and electronic support groups: systematic review of the effects of online peer to peer interactions.Bmj, 328(7449):1166, 2004

2004

[16] [16]

Farahbakhsh, A

R. Farahbakhsh, A. Cuevas, and N. Crespi. Characterization of cross-posting activity for professional users across major osns. InProceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pages 645–650, 2015

2015

[17] [17]

for Disease Control and P

C. for Disease Control and P. (CDC). Hear her campaign. https://www.cdc.gov/ hearher/index.html, 2024

2024

[18] [18]

Gottfried and E

J. Gottfried and E. Shearer. News use across social media platforms 2016.Pew Research Center, 2016

2016

[19] [19]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Stein- hardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[20] [20]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021

[21] [21]

G. J. Johnson and P. J. Ambrose. Neo-tribes: The power and potential of online communities in health care.Communications of the ACM, 49(1):107–113, 2006

2006

[22] [22]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Racial disparities in maternal and infant health: Current status and key issues

Kaiser Family Foundation. Racial disparities in maternal and infant health: Current status and key issues. https://www.kff.org/racial-equity-and-health- policy/racial-disparities-in-maternal-and-infant-health-current-status-and- key-issues/, 2025

2025

[24] [24]

Kanchan and A

S. Kanchan and A. Gaidhane. Social media role and its impact on public health: a narrative review.Cureus, 15(1), 2023

2023

[25] [25]

S. S. Khan, N. A. Cameron, and K. J. Lindley. Pregnancy as an early cardiovascular moment: peripartum cardiovascular health.Circulation research, 132(12):1584– 1606, 2023

2023

[26] [26]

M. H. Kwon, S. W. Kwon, R. K. Das, and B. C. Drolet. Obstetric and gynecologic care in tiktok: Top influencers and posts.Reproductive Sciences, 30(10):2889–2892, 2023

2023

[27] [27]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

J. Liu, S. Min, L. Zettlemoyer, Y. Choi, and H. Hajishirzi. Infini-gram: Scal- ing unbounded n-gram language models to a trillion tokens.arXiv preprint arXiv:2401.17377, 2024

work page arXiv 2024

[29] [29]

J. Liu, T. Blanton, Y. Elazar, S. Min, Y. Chen, A. Chheda-Kothary, H. Tran, B. Bischoff, E. Marsh, M. Schmitz, et al. Olmotrace: Tracing language model outputs back to trillions of training tokens.arXiv preprint arXiv:2504.07096, 2025

work page arXiv 2025

[30] [30]

A. . M. Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/ abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Mannhardt, E

N. Mannhardt, E. Bondi-Kelly, B. Lam, H. Mozannar, C. O’Connell, M. Asiedu, A. Buendia, T. Urman, I. B. Riaz, C. E. Ricciardi, et al. Impact of large language model assistance on patients reading clinical notes: a mixed-methods study.arXiv preprint arXiv:2401.09637, 2024

work page arXiv 2024

[32] [32]

Markov, C

T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng. A holistic approach to undesired content detection in the real world. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023

2023

[33] [33]

arXiv preprint arXiv:2305.14552 , year=

N. McKenna, T. Li, L. Cheng, M. J. Hosseini, M. Johnson, and M. Steedman. Sources of hallucination by large language models on inference tasks.arXiv preprint arXiv:2305.14552, 2023

work page arXiv 2023

[34] [34]

More speech and fewer mistakes

Meta. More speech and fewer mistakes. https://about.fb.com/news/2025/01/meta- more-speech-fewer-mistakes/, Jan. 2025

2025

[35] [35]

S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.arXiv preprint arXiv:2305.14251, 2023

work page arXiv 2023

[36] [36]

Montero, J

A. Montero, J. Montalvo III, A. Kearney, I. Valdes, A. Kirzinger, and L. Hamel. Kff tracking poll on health information and trust: Use of ai for health information and advice.KFF Poll, March, pages 435–444, 2026

2026

[37] [37]

National Academies of Sciences, Engineering, and Medicine and others. Overview of research gaps for selected conditions in women’s health research at the national institutes of health: Proceedings of a workshop—in brief.National Academies of Sciences, Engineering, and Medicine; Health and Medicine Division, 2024

2024

[38] [38]

Q. C. Nguyen, E. M. Aparicio, M. Jasczynski, A. C. Doig, X. Yue, H. Mane, N. Srikanth, F. X. M. Gutierrez, N. Delcid, X. He, et al. Rosie, a health edu- cation question-and-answer chatbot for new mothers: randomized pilot study. JMIR Formative Research, 8(1):e51361, 2024

2024

[39] [39]

A. C. of Obstetricians and G. (ACOG). Prenatal care. https://www.acog.org/ programs/redesigning-prenatal-care-initiative, 2020

2020

[40] [40]

URL https://platform.openai.com/docs/models/omni-moderation- latest

OpenAI, 2025. URL https://platform.openai.com/docs/models/omni-moderation- latest

2025

[41] [41]

J. C. Penny-Dimri, M. Bachmann, W. R. Cooke, S. Mathewlynn, S. Dockree, J. Tolladay, J. Kossen, L. Li, Y. Gal, and G. D. Jones. Reducing large language model safety risks in women’s health using semantic entropy.arXiv preprint arXiv:2503.00269, 2025

work page arXiv 2025

[42] [42]

Automatic Detection of Fake News

V. Pérez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea. Automatic detection of fake news.arXiv preprint arXiv:1708.07104, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[43] [43]

Ramjee, M

P. Ramjee, M. Chhokar, B. Sachdeva, M. Meena, H. Abdullah, A. Vashistha, R. Na- gar, and M. Jain. Ashabot: An llm-powered chatbot to support the informational needs of community health workers.arXiv preprint arXiv:2409.10913, 2024

work page arXiv 2024

[44] [44]

M. Rao. Medfluencing: A new role for physicians in the age of tiktok.Georgetown Medical Review, 6(1), 2022

2022

[45] [45]

Renault, M

T. Renault, M. Mosleh, and D. G. Rand. @grok is this true? llm-powered fact- checking on social media, Jan 2026. URL osf.io/preprints/psyarxiv/85quw_v2

2026

[46] [46]

K. Saab, T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

A. M. Smith, A. S. Mucedola, and A. Ausness-Ayres. My Video, My Choice: A Quantitative Content Analysis of U.S. OBGYN TikTok Videos in the Post-RoeEra. Women’s Reproductive Health, pages 1–25, Aug. 2024. ISSN 2329-3691, 2329-3713. doi: 10.1080/23293691.2024.2377969. URL https://www.tandfonline.com/doi/full/ 10.1080/23293691.2024.2377969

work page doi:10.1080/23293691.2024.2377969 2024

[48] [48]

Srikanth, R

N. Srikanth, R. Sarkar, H. Mane, E. M. Aparicio, Q. C. Nguyen, R. Rudinger, and J. Boyd-Graber. Pregnant questions: The importance of pragmatic awareness in maternal health question answering.arXiv preprint arXiv:2311.09542, 2023

work page arXiv 2023

[49] [49]

Stein, Y

K. Stein, Y. Yao, and T. Aitamurto. Examining Communicative Forms in #TikTok- Docs’ Sexual Health Videos.International Journal of Communication, 16(0):23, Feb

[50] [50]

URL https://ijoc.org/index.php/ijoc/article/view/18175

ISSN 1932-8036. URL https://ijoc.org/index.php/ijoc/article/view/18175. Number: 0. KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Vaibhav Balloli et al

1932

[51] [51]

Stephenson, C

S. Stephenson, C. N. Page, M. Wei, A. Kapadia, and F. Roesner. Sharenting on tiktok: Exploring parental sharing behaviors and the discourse around children’s online privacy. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–17, 2024

2024

[52] [52]

Suarez-Lledo and J

V. Suarez-Lledo and J. Alvarez-Galvez. Prevalence of health misinformation on social media: systematic review.Journal of medical Internet research, 23(1):e17187, 2021

2021

[53] [53]

Y. Sun, J. He, S. Lei, L. Cui, and C.-T. Lu. Med-mmhl: A multi-modal dataset for detecting human-and llm-generated misinformation in the medical domain. arXiv preprint arXiv:2306.08871, 2023

work page arXiv 2023

[54] [54]

J.-J. Tian, H. Yu, Y. Orlovskiy, T. Vergho, M. Rivera, M. Goel, Z. Yang, J.-F. Godbout, R. Rabbany, and K. Pelrine. Web retrieval agents for evidence-based misinforma- tion detection.arXiv preprint arXiv:2409.00009, 2024

work page arXiv 2024

[55] [55]

Völske, P

M. Völske, P. Braslavski, M. Hagen, G. Lezina, and B. Stein. What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries. InProceed- ings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1571–1580, Melbourne Australia, Oct. 2015. ACM. ISBN 978-1-4503-3794-6. doi: 10.1145/2806416.2806457. URL htt...

work page doi:10.1145/2806416.2806457 2015

[56] [56]

L. Wu, F. Morstatter, K. M. Carley, and H. Liu. Misinformation in social media: definition, manipulation, and detection.ACM SIGKDD explorations newsletter, 21 (2):80–90, 2019

2019

[57] [57]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

On Verbalized Confidence Scores for LLMs

D. Yang, Y.-H. H. Tsai, and M. Yamada. On verbalized confidence scores for llms. arXiv preprint arXiv:2412.14737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

W. Zeng, Y. Liu, R. Mullins, L. Peran, J. Fernandez, H. Harkous, K. Narasimhan, D. Proud, P. Kumar, B. Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

but the data is already public

M. Zimmer. “but the data is already public”: on the ethics of research in facebook. InThe ethics of information technologies, pages 229–241. Routledge, 2020

2020

[61] [61]

Oligohydramnios.. diagnosed with < 5 cm

G. Zuccon, B. Koopman, and J. Palotti. Diagnose this if you can: On the effective- ness of search engines in finding medical self-diagnosis information. InAdvances in Information Retrieval: 37th European Conference on IR Research, ECIR 2015, Vienna, Austria, March 29-April 2, 2015. Proceedings 37, pages 562–567. Springer, 2015. A Design choices and artifa...

work page arXiv 2015