Recognition: unknown
When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation
Pith reviewed 2026-05-09 21:17 UTC · model grok-4.3
The pith
LLMs fail to detect health misinformation that blends sacred Indian traditions with pseudo-scientific claims on YouTube.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using gomutra discourse on YouTube in India as a case study, the post-facto analysis of 30 multilingual transcripts shows that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs trained predominantly on Western corpora are systematically ill-equipped to analyse. Varying prompt tone across GPT-4o, Gemini 2.5 Pro, and DeepSeek-V3.1 finds that culturally embedded health misinformation does not resemble ordinary misinformation, with this obfuscation extending to gendered rhetoric and prompt design.
What carries the argument
The rhetorical register created when sacred traditional language blends with pseudo-scientific claims in both promotional and debunking YouTube content.
If this is right
- Culturally embedded health misinformation on Global South platforms requires detection methods beyond those effective for standard false claims.
- Prompt engineering cannot retrofit cultural competency into LLM-assisted discourse analysis.
- Gendered rhetoric within the blended content further increases analytical unreliability.
- The pattern applies to multilingual transcripts where traditional and scientific registers intersect.
- Social media health information in regions like India will continue to evade current LLM tools without deeper changes.
Where Pith is reading between the lines
- Similar rhetorical blending may appear in traditional remedy discussions from other cultures, suggesting the limitation is not unique to Indian contexts.
- Platform moderation systems could benefit from training data drawn directly from local languages and belief systems rather than translated Western examples.
- The findings point toward developing hybrid human-AI workflows for health content review in linguistically diverse settings.
- Future work could measure whether larger models close the gap or whether the cultural mismatch persists across scales.
Load-bearing premise
That the observed blending and LLM failures in the 30 transcripts arise primarily from Western-centric training data rather than model scale, prompt phrasing, or sample selection.
What would settle it
Testing the same three LLMs on a larger sample of culturally blended health claims from multiple non-Western regions and measuring accuracy against human annotators familiar with each culture.
Figures
read the original abstract
Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a post-facto LLM-assisted qualitative discourse analysis of 30 multilingual YouTube transcripts on gomutra (cow urine) as a health remedy in India. It claims that promotional content blends sacred traditional language with pseudo-scientific claims in a rhetorical register that is mirrored even in sophisticated debunking content, rendering LLMs (trained predominantly on Western corpora) systematically unable to detect the misinformation; experiments varying prompt tone across GPT-4o, Gemini 2.5 Pro, and DeepSeek-V3.1 are used to argue that cultural competency cannot be retrofitted via prompt engineering alone.
Significance. If the central observations hold, the work usefully draws attention to how non-Western rhetorical patterns in health discourse can evade standard LLM-based misinformation detection, with direct relevance to content moderation and public-health information on platforms serving the Global South. The use of multiple models and prompt variations is a modest strength, but the absence of quantitative grounding limits the strength of the causal claims.
major comments (3)
- [Methods] The manuscript provides no details on transcript selection criteria, search terms, or sampling frame for the 30 videos (Methods section). Without this, it is impossible to evaluate selection bias or whether the observed sacred/pseudo-scientific blending is representative rather than an artifact of the chosen sample.
- [Results] No quantitative metrics (accuracy, precision, recall, or confusion matrices) are reported for the LLMs' classifications of promotional versus debunking content, nor is inter-annotator agreement provided for the human identification of 'mirroring' rhetoric (Results and Analysis sections). The central claim that LLMs are 'systematically ill-equipped' therefore rests entirely on unquantified qualitative interpretation.
- [Discussion] The attribution of LLM failures primarily to Western-centric training data (Discussion) is not supported by any ablation: there are no comparisons against base versus instruction-tuned models, no culturally aligned comparator models, and no systematic variation of model scale or training corpus proxies. This leaves open the possibility that failures arise from task ambiguity, prompt phrasing, or model limitations instead.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly separate the descriptive finding (rhetorical mirroring) from the causal claim (Western training data as the root cause) to avoid conflating observation with explanation.
- [Results] Figure or table summarizing the prompt variations and model responses across the 30 transcripts would improve readability and allow readers to assess the consistency of the reported patterns.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key areas where the manuscript can be clarified and strengthened. We address each major comment point by point below, indicating the revisions we will make.
read point-by-point responses
-
Referee: [Methods] The manuscript provides no details on transcript selection criteria, search terms, or sampling frame for the 30 videos (Methods section). Without this, it is impossible to evaluate selection bias or whether the observed sacred/pseudo-scientific blending is representative rather than an artifact of the chosen sample.
Authors: We agree that the current Methods section lacks sufficient transparency on sampling. In the revised manuscript, we will add a new subsection specifying the search terms (including English, Hindi, and regional language variants such as 'gomutra health', 'cow urine remedy', and 'gaumutra benefits'), the sampling frame (YouTube videos uploaded between 2018 and 2024 with minimum view thresholds), and inclusion criteria (focus on health claims, availability of auto-generated or manual transcripts, and exclusion of purely devotional or non-health content). This addition will allow readers to assess representativeness and potential biases. revision: yes
-
Referee: [Results] No quantitative metrics (accuracy, precision, recall, or confusion matrices) are reported for the LLMs' classifications of promotional versus debunking content, nor is inter-annotator agreement provided for the human identification of 'mirroring' rhetoric (Results and Analysis sections). The central claim that LLMs are 'systematically ill-equipped' therefore rests entirely on unquantified qualitative interpretation.
Authors: The study employs a qualitative discourse-analytic approach rather than a quantitative classification benchmark; therefore, standard metrics such as accuracy or confusion matrices are not directly applicable, as they would require imposing artificial binary ground-truth labels on culturally nuanced rhetorical patterns. The identification of mirroring rhetoric emerged from iterative human review of LLM outputs across prompt variations. In revision, we will expand the Results and Analysis sections with a more explicit account of the coding process and any consistency checks performed. We will also clarify that the claim of systematic limitations is grounded in the repeated failure patterns observed across three distinct models rather than in statistical performance measures. revision: partial
-
Referee: [Discussion] The attribution of LLM failures primarily to Western-centric training data (Discussion) is not supported by any ablation: there are no comparisons against base versus instruction-tuned models, no culturally aligned comparator models, and no systematic variation of model scale or training corpus proxies. This leaves open the possibility that failures arise from task ambiguity, prompt phrasing, or model limitations instead.
Authors: We acknowledge that stronger causal evidence would require ablations comparing base and instruction-tuned models or culturally specific comparators. Such experiments lie beyond the scope and computational resources of the present exploratory study. In the revised Discussion, we will explicitly acknowledge this limitation, discuss alternative explanations including task ambiguity and prompt sensitivity, and moderate the strength of the training-data attribution while retaining the observation that consistent failures across GPT-4o, Gemini 2.5 Pro, and DeepSeek-V3.1 point to deeper representational gaps. We will also incorporate additional citations on documented cultural biases in LLMs to support the interpretive framing. revision: partial
Circularity Check
No significant circularity; empirical discourse analysis stands on direct transcript-model comparisons
full rationale
The paper performs a qualitative post-facto analysis of 30 specific multilingual YouTube transcripts, applying varied prompt tones to three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1) and reporting observed misclassifications. No equations, fitted parameters, or self-referential definitions appear in the provided text. The central claim—that cultural obfuscation in sacred-plus-pseudo-scientific rhetoric defeats prompt engineering—is presented as an interpretive conclusion drawn from the concrete model outputs on the sampled content rather than presupposed by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the result. The derivation chain therefore remains self-contained against the external benchmark of the transcripts themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs trained predominantly on Western corpora are systematically ill-equipped to analyze culturally embedded health misinformation that blends sacred traditional language with pseudo-scientific claims.
Reference graph
Works this paper leans on
-
[1]
Chhikara, G.; Kumar, A.; and Chakraborty, A
Use of Linguistic Communication Strategies (Hedges and Intensifiers) in Simulated Pharmacy Education Shared Decision-Making.American Journal of Pharmaceutical Ed- ucation, 89(10): 101492. Chhikara, G.; Kumar, A.; and Chakraborty, A. 2025. Through the Prism of Culture: Evaluating LLMs’ Un- derstanding of Indian Subcultures and Traditions.arXiv preprint arX...
-
[2]
The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs.arXiv, abs/2502.04134. Gulf News. 2019. Indian Health Minister Ashwini Kumar Choubey Is Working on Cow Urine to Prepare Medicines, India. Gumma, V .; Raghunath, A.; Jain, M.; and Sitaram, S
-
[3]
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings.arXiv preprint arXiv:2410.13671. Holmes, J. 1990. Hedges and boosters in women’s and men’s speech.Language & Communication, 10(3): 185– 205. Hurford, B.; Rana, A.; and Sachan, R. S. K. 2022. Narrative- based misinformation in India about protection against Covid-...
-
[4]
Robust Speech Recognition via Large-Scale Weak Supervision
Detecting Deception in Disinformation Across Lan- guages: The Role of Linguistic Markers. InProceedings of Recent Advances in Natural Language Processing (RANLP), 943–952. Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2022. Robust Speech Recogni- tion via Large-Scale Weak Supervision.arXiv preprint arXiv:2212.04356. Rodrig...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Van Der Linden, S
Investigation of the Misinformation about COVID- 19 on YouTube Using Topic Modeling, Sentiment Analysis, and Language Analysis.Computation, 12(2): 28. Van Der Linden, S. 2022. Misinformation: susceptibility, spread, and interventions to immunize the public.Nature medicine, 28(3): 460–467. Van der Linden, S.; Albarrac ´ın, D.; Fazio, L.; Freelon, D.; Rooze...
2022
-
[6]
healthy" or
Using psychological science to understand and fight health misinformation: An APA consensus statement.Amer- ican Psychologist. Velutharambath, A.; Sassenberg, K.; and Klinger, R. 2026. What if Deception cannot be Detected? A Cross-linguistic Study on the Limits of Deception Detection from Text.Com- putational Linguistics, 1–71. V osoughi, S.; Roy, D.; and...
2026
-
[9]
Please carefully review these texts and complete the following tasks to help us with qualitative analysis: Tasks:
Summarize any patterns or common techniques observed across the transcripts regarding the use of intensifiers in promoting misinformation Friendly Prompt User Instruction: Hi Gemini 2.5 Pro! You’re provided with a document containing transcripts from YouTube videos that discuss misinformation aboutgomutra(cow urine). Please carefully review these texts an...
-
[11]
•Please identify the word it modifies
For each identified intensifier: •Please quote the exact intensifier. •Please identify the word it modifies. •Please indicate its precise position within the text (e.g., beginning, middle, end of the sentence/paragraph). •Please explain clearly how the intensifier amplifies or contributes to misinformation about gomutra
-
[12]
Thanks for your help! Few Shot Prompt The model was (a) first conditioned with a 5-transcript sample to steer its output before (b) processing all the transcripts
Summarize any patterns or common techniques you’ve noticed across the transcripts regarding the use of intensifiers to promote misinformation. Thanks for your help! Few Shot Prompt The model was (a) first conditioned with a 5-transcript sample to steer its output before (b) processing all the transcripts. (a) User Instruction: Below are a few transcripts ...
-
[15]
Carefully review the provided texts and complete the following tasks: Tasks:
Summarize any patterns or common techniques observed across the transcripts regarding the use of intensifiers in promoting misinformation (b) User Instruction: You are now provided with a document containing all the transcripts from YouTube videos that discuss misinformation aboutgomutra(cow urine). Carefully review the provided texts and complete the fol...
-
[16]
(as in \The food was very good
Identify all intensifiers in each transcript. Definition of an intensifier:An intensifier is a word or a phrase specifically used to strengthen, exaggerate, or emphasize a claim. It is a lexical item that operates as a degree modifier on an adjective, an adverb, or occasionally a verb phrase. It augments the head it modifies, without altering the core sem...
-
[17]
•Identify the word it modifies
For each identified intensifier: •Quote the exact intensifier. •Identify the word it modifies. •Indicate its precise position within the text (e.g., beginning, middle, end of the sentence/paragraph). •Explain clearly how the intensifier amplifies or contributes to misinformation aboutgomutra
-
[18]
•Hardware Allocation:Tasks were run on dynamically allocated high-RAM environments equipped with stan- dard Pro-tier GPUs (predominantly NVIDIA V100 or T4)
Summarize any patterns or common techniques observed across the transcripts regarding the use of intensifiers in promoting misinformation Computational Environment and Compute Budget •Compute Budget:All experiments, data preprocessing, and API integrations were executed using a Google Co- lab Pro subscription (˜$10 USD/month). •Hardware Allocation:Tasks w...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.