arxiv: 2604.22002 · v1 · submitted 2026-04-23 · 💻 cs.CL

Recognition: unknown

When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation

Anamta Khan , Ratna Kandala , Deepti , Sheza Munir , Joyojeet Pal

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM limitationshealth misinformationcultural discourse analysisYouTube transcriptsgomutraprompt engineeringGlobal Southrhetorical register

0 comments

The pith

LLMs fail to detect health misinformation that blends sacred Indian traditions with pseudo-scientific claims on YouTube.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines YouTube videos promoting cow urine as a cure for ailments in India. It shows that promotional content mixes religious language with scientific-sounding assertions, and even sophisticated debunking adopts similar styles. Analysis of 30 multilingual transcripts across three LLMs with varied prompts reveals that this creates a rhetorical register LLMs trained on Western data cannot reliably parse. The work argues that prompt engineering alone cannot supply the needed cultural competency for such discourse analysis.

Core claim

Using gomutra discourse on YouTube in India as a case study, the post-facto analysis of 30 multilingual transcripts shows that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs trained predominantly on Western corpora are systematically ill-equipped to analyse. Varying prompt tone across GPT-4o, Gemini 2.5 Pro, and DeepSeek-V3.1 finds that culturally embedded health misinformation does not resemble ordinary misinformation, with this obfuscation extending to gendered rhetoric and prompt design.

What carries the argument

The rhetorical register created when sacred traditional language blends with pseudo-scientific claims in both promotional and debunking YouTube content.

If this is right

Culturally embedded health misinformation on Global South platforms requires detection methods beyond those effective for standard false claims.
Prompt engineering cannot retrofit cultural competency into LLM-assisted discourse analysis.
Gendered rhetoric within the blended content further increases analytical unreliability.
The pattern applies to multilingual transcripts where traditional and scientific registers intersect.
Social media health information in regions like India will continue to evade current LLM tools without deeper changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar rhetorical blending may appear in traditional remedy discussions from other cultures, suggesting the limitation is not unique to Indian contexts.
Platform moderation systems could benefit from training data drawn directly from local languages and belief systems rather than translated Western examples.
The findings point toward developing hybrid human-AI workflows for health content review in linguistically diverse settings.
Future work could measure whether larger models close the gap or whether the cultural mismatch persists across scales.

Load-bearing premise

That the observed blending and LLM failures in the 30 transcripts arise primarily from Western-centric training data rather than model scale, prompt phrasing, or sample selection.

What would settle it

Testing the same three LLMs on a larger sample of culturally blended health claims from multiple non-Western regions and measuring accuracy against human annotators familiar with each culture.

Figures

Figures reproduced from arXiv: 2604.22002 by Anamta Khan, Deepti, Joyojeet Pal, Ratna Kandala, Sheza Munir.

**Figure 1.** Figure 1: An infographic visualizing the ’Digital Proliferation of Traditional Health Beliefs’ through cow urine ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of traditional metaphor and scientific [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of the number of unique intensifiers generated across different models (Gemini, GPT4o-mini, and [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows LLMs miss the mark on Indian YouTube health claims that blend sacred language with pseudo-science because debunking content mirrors the same register, but the support stays interpretive.

read the letter

The core observation is that promotional gomutra videos mix traditional reverence with scientific-sounding claims, and the critical responses often adopt similar phrasing, which trips up models like GPT-4o, Gemini, and DeepSeek even after prompt changes. That pattern is worth noting for anyone trying to use LLMs on non-Western health content. The work is new in its targeted look at this specific discourse on YouTube and the post-facto check across three models with varied prompt tones. It does a decent job laying out how cultural embedding and gendered rhetoric add layers that standard detection misses, and the abstract makes the practical stakes clear for platform work in the Global South. The 30 transcripts give a concrete set of examples rather than abstract claims. The soft spots sit in the methods. Everything rests on qualitative reading of the transcripts with no success rates, no inter-annotator agreement, and no breakdown of how often each model failed on promotional versus debunking material. The link to Western-centric training data is plausible but not isolated from other factors like prompt wording or the inherent ambiguity of the rhetoric itself. Without a comparator model trained on more Indian-language data or quantitative metrics, the conclusion that prompt engineering cannot help feels early. Readers working on AI for misinformation or public health in multilingual settings will find the case useful as a warning flag. It is not a methods paper, so it will not replace systematic benchmarks, but the topic is timely enough that a serious referee should see it. The authors engage the literature on cultural bias without obvious contradictions. I would send it for peer review with the expectation that reviewers ask for clearer counts and at least one ablation on model or data source.

Referee Report

3 major / 2 minor

Summary. The paper conducts a post-facto LLM-assisted qualitative discourse analysis of 30 multilingual YouTube transcripts on gomutra (cow urine) as a health remedy in India. It claims that promotional content blends sacred traditional language with pseudo-scientific claims in a rhetorical register that is mirrored even in sophisticated debunking content, rendering LLMs (trained predominantly on Western corpora) systematically unable to detect the misinformation; experiments varying prompt tone across GPT-4o, Gemini 2.5 Pro, and DeepSeek-V3.1 are used to argue that cultural competency cannot be retrofitted via prompt engineering alone.

Significance. If the central observations hold, the work usefully draws attention to how non-Western rhetorical patterns in health discourse can evade standard LLM-based misinformation detection, with direct relevance to content moderation and public-health information on platforms serving the Global South. The use of multiple models and prompt variations is a modest strength, but the absence of quantitative grounding limits the strength of the causal claims.

major comments (3)

[Methods] The manuscript provides no details on transcript selection criteria, search terms, or sampling frame for the 30 videos (Methods section). Without this, it is impossible to evaluate selection bias or whether the observed sacred/pseudo-scientific blending is representative rather than an artifact of the chosen sample.
[Results] No quantitative metrics (accuracy, precision, recall, or confusion matrices) are reported for the LLMs' classifications of promotional versus debunking content, nor is inter-annotator agreement provided for the human identification of 'mirroring' rhetoric (Results and Analysis sections). The central claim that LLMs are 'systematically ill-equipped' therefore rests entirely on unquantified qualitative interpretation.
[Discussion] The attribution of LLM failures primarily to Western-centric training data (Discussion) is not supported by any ablation: there are no comparisons against base versus instruction-tuned models, no culturally aligned comparator models, and no systematic variation of model scale or training corpus proxies. This leaves open the possibility that failures arise from task ambiguity, prompt phrasing, or model limitations instead.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly separate the descriptive finding (rhetorical mirroring) from the causal claim (Western training data as the root cause) to avoid conflating observation with explanation.
[Results] Figure or table summarizing the prompt variations and model responses across the 30 transcripts would improve readability and allow readers to assess the consistency of the reported patterns.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where the manuscript can be clarified and strengthened. We address each major comment point by point below, indicating the revisions we will make.

read point-by-point responses

Referee: [Methods] The manuscript provides no details on transcript selection criteria, search terms, or sampling frame for the 30 videos (Methods section). Without this, it is impossible to evaluate selection bias or whether the observed sacred/pseudo-scientific blending is representative rather than an artifact of the chosen sample.

Authors: We agree that the current Methods section lacks sufficient transparency on sampling. In the revised manuscript, we will add a new subsection specifying the search terms (including English, Hindi, and regional language variants such as 'gomutra health', 'cow urine remedy', and 'gaumutra benefits'), the sampling frame (YouTube videos uploaded between 2018 and 2024 with minimum view thresholds), and inclusion criteria (focus on health claims, availability of auto-generated or manual transcripts, and exclusion of purely devotional or non-health content). This addition will allow readers to assess representativeness and potential biases. revision: yes
Referee: [Results] No quantitative metrics (accuracy, precision, recall, or confusion matrices) are reported for the LLMs' classifications of promotional versus debunking content, nor is inter-annotator agreement provided for the human identification of 'mirroring' rhetoric (Results and Analysis sections). The central claim that LLMs are 'systematically ill-equipped' therefore rests entirely on unquantified qualitative interpretation.

Authors: The study employs a qualitative discourse-analytic approach rather than a quantitative classification benchmark; therefore, standard metrics such as accuracy or confusion matrices are not directly applicable, as they would require imposing artificial binary ground-truth labels on culturally nuanced rhetorical patterns. The identification of mirroring rhetoric emerged from iterative human review of LLM outputs across prompt variations. In revision, we will expand the Results and Analysis sections with a more explicit account of the coding process and any consistency checks performed. We will also clarify that the claim of systematic limitations is grounded in the repeated failure patterns observed across three distinct models rather than in statistical performance measures. revision: partial
Referee: [Discussion] The attribution of LLM failures primarily to Western-centric training data (Discussion) is not supported by any ablation: there are no comparisons against base versus instruction-tuned models, no culturally aligned comparator models, and no systematic variation of model scale or training corpus proxies. This leaves open the possibility that failures arise from task ambiguity, prompt phrasing, or model limitations instead.

Authors: We acknowledge that stronger causal evidence would require ablations comparing base and instruction-tuned models or culturally specific comparators. Such experiments lie beyond the scope and computational resources of the present exploratory study. In the revised Discussion, we will explicitly acknowledge this limitation, discuss alternative explanations including task ambiguity and prompt sensitivity, and moderate the strength of the training-data attribution while retaining the observation that consistent failures across GPT-4o, Gemini 2.5 Pro, and DeepSeek-V3.1 point to deeper representational gaps. We will also incorporate additional citations on documented cultural biases in LLMs to support the interpretive framing. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical discourse analysis stands on direct transcript-model comparisons

full rationale

The paper performs a qualitative post-facto analysis of 30 specific multilingual YouTube transcripts, applying varied prompt tones to three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1) and reporting observed misclassifications. No equations, fitted parameters, or self-referential definitions appear in the provided text. The central claim—that cultural obfuscation in sacred-plus-pseudo-scientific rhetoric defeats prompt engineering—is presented as an interpretive conclusion drawn from the concrete model outputs on the sampled content rather than presupposed by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the result. The derivation chain therefore remains self-contained against the external benchmark of the transcripts themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the premise that observed LLM shortcomings stem from training data origins and that the sampled discourse exemplifies a broader class of culturally embedded misinformation; no free parameters or new entities are introduced.

axioms (1)

domain assumption LLMs trained predominantly on Western corpora are systematically ill-equipped to analyze culturally embedded health misinformation that blends sacred traditional language with pseudo-scientific claims.
This premise is used to explain why prompt variations fail to improve detection and is invoked to argue that prompt engineering cannot retrofit cultural competency.

pith-pipeline@v0.9.0 · 5471 in / 1470 out tokens · 54468 ms · 2026-05-09T21:17:07.109877+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Chhikara, G.; Kumar, A.; and Chakraborty, A

Use of Linguistic Communication Strategies (Hedges and Intensifiers) in Simulated Pharmacy Education Shared Decision-Making.American Journal of Pharmaceutical Ed- ucation, 89(10): 101492. Chhikara, G.; Kumar, A.; and Chakraborty, A. 2025. Through the Prism of Culture: Evaluating LLMs’ Un- derstanding of Indian Subcultures and Traditions.arXiv preprint arX...

work page arXiv 2025
[2]

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs.arXiv preprint arXiv:2502.04134, 2025

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs.arXiv, abs/2502.04134. Gulf News. 2019. Indian Health Minister Ashwini Kumar Choubey Is Working on Cow Urine to Prepare Medicines, India. Gumma, V .; Raghunath, A.; Jain, M.; and Sitaram, S

work page arXiv 2019
[3]

moo-point

HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings.arXiv preprint arXiv:2410.13671. Holmes, J. 1990. Hedges and boosters in women’s and men’s speech.Language & Communication, 10(3): 185– 205. Hurford, B.; Rana, A.; and Sachan, R. S. K. 2022. Narrative- based misinformation in India about protection against Covid-...

work page arXiv 1990
[4]

Robust Speech Recognition via Large-Scale Weak Supervision

Detecting Deception in Disinformation Across Lan- guages: The Role of Linguistic Markers. InProceedings of Recent Advances in Natural Language Processing (RANLP), 943–952. Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2022. Robust Speech Recogni- tion via Large-Scale Weak Supervision.arXiv preprint arXiv:2212.04356. Rodrig...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Van Der Linden, S

Investigation of the Misinformation about COVID- 19 on YouTube Using Topic Modeling, Sentiment Analysis, and Language Analysis.Computation, 12(2): 28. Van Der Linden, S. 2022. Misinformation: susceptibility, spread, and interventions to immunize the public.Nature medicine, 28(3): 460–467. Van der Linden, S.; Albarrac ´ın, D.; Fazio, L.; Freelon, D.; Rooze...

2022
[6]

healthy" or

Using psychological science to understand and fight health misinformation: An APA consensus statement.Amer- ican Psychologist. Velutharambath, A.; Sassenberg, K.; and Klinger, R. 2026. What if Deception cannot be Detected? A Cross-linguistic Study on the Limits of Deception Detection from Text.Com- putational Linguistics, 1–71. V osoughi, S.; Roy, D.; and...

2026
[9]

Please carefully review these texts and complete the following tasks to help us with qualitative analysis: Tasks:

Summarize any patterns or common techniques observed across the transcripts regarding the use of intensifiers in promoting misinformation Friendly Prompt User Instruction: Hi Gemini 2.5 Pro! You’re provided with a document containing transcripts from YouTube videos that discuss misinformation aboutgomutra(cow urine). Please carefully review these texts an...
[11]

•Please identify the word it modifies

For each identified intensifier: •Please quote the exact intensifier. •Please identify the word it modifies. •Please indicate its precise position within the text (e.g., beginning, middle, end of the sentence/paragraph). •Please explain clearly how the intensifier amplifies or contributes to misinformation about gomutra
[12]

Thanks for your help! Few Shot Prompt The model was (a) first conditioned with a 5-transcript sample to steer its output before (b) processing all the transcripts

Summarize any patterns or common techniques you’ve noticed across the transcripts regarding the use of intensifiers to promote misinformation. Thanks for your help! Few Shot Prompt The model was (a) first conditioned with a 5-transcript sample to steer its output before (b) processing all the transcripts. (a) User Instruction: Below are a few transcripts ...
[15]

Carefully review the provided texts and complete the following tasks: Tasks:

Summarize any patterns or common techniques observed across the transcripts regarding the use of intensifiers in promoting misinformation (b) User Instruction: You are now provided with a document containing all the transcripts from YouTube videos that discuss misinformation aboutgomutra(cow urine). Carefully review the provided texts and complete the fol...
[16]

(as in \The food was very good

Identify all intensifiers in each transcript. Definition of an intensifier:An intensifier is a word or a phrase specifically used to strengthen, exaggerate, or emphasize a claim. It is a lexical item that operates as a degree modifier on an adjective, an adverb, or occasionally a verb phrase. It augments the head it modifies, without altering the core sem...
[17]

•Identify the word it modifies

For each identified intensifier: •Quote the exact intensifier. •Identify the word it modifies. •Indicate its precise position within the text (e.g., beginning, middle, end of the sentence/paragraph). •Explain clearly how the intensifier amplifies or contributes to misinformation aboutgomutra
[18]

•Hardware Allocation:Tasks were run on dynamically allocated high-RAM environments equipped with stan- dard Pro-tier GPUs (predominantly NVIDIA V100 or T4)

Summarize any patterns or common techniques observed across the transcripts regarding the use of intensifiers in promoting misinformation Computational Environment and Compute Budget •Compute Budget:All experiments, data preprocessing, and API integrations were executed using a Google Co- lab Pro subscription (˜$10 USD/month). •Hardware Allocation:Tasks w...