Splits! Flexible Sociocultural Linguistic Investigation at Scale
Pith reviewed 2026-05-22 21:01 UTC · model grok-4.3
The pith
A demographically and topically split Reddit dataset creates a reusable sandbox for scalable sociocultural linguistic research.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors construct Splits!, a demographically and topically split Reddit dataset, and validate it by self-identification and replication of known SLPs. They show its utility through a scalable two-stage process that filters collections of potential SLPs to surface promising candidates for deeper investigation.
What carries the argument
The Splits! dataset of demographically and topically split Reddit posts, which supports flexible exploration of how sociocultural backgrounds shape language use.
If this is right
- Researchers can explore hypotheses about language shaped by background without specialized data collection for each project.
- Known SLPs from the literature can be replicated systematically across the dataset.
- Large numbers of potential SLPs can be narrowed efficiently to those worth deeper qualitative work.
- The sandbox supports scalable prototyping of analyses that link context and background to language patterns.
Where Pith is reading between the lines
- Similar splitting methods could extend to other online text collections to study additional cultural contexts.
- The splits might reveal when language models trained on mixed data overlook demographic-specific patterns.
- Further replication tests on additional known phenomena would test the dataset's broader reliability.
Load-bearing premise
Demographic and topical splits on Reddit data accurately capture sociocultural linguistic phenomena in a way that permits replication of known patterns.
What would settle it
If replication of multiple known SLPs from the literature on the Splits! dataset produces results inconsistent with prior findings, the validation would fail.
Figures
read the original abstract
Variation in language use, shaped by speakers' sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. For example, Chinese students discuss "healthy eating" with words like "timing," "regularity," and "digestion," whereas Americans use vocabulary like "balancing food groups" and "avoiding fat and sugar," reflecting distinct cultural models of nutrition. The computational study of these Sociocultural Linguistic Phenomena (SLP) has traditionally been done in NLP via tailored analyses of specific groups or topics, requiring specialized data collection and experimental operationalization--a process not well-suited to quick hypothesis exploration and prototyping. To address this, we propose constructing a "sandbox" designed for systematic and flexible sociolinguistic research. Using our method, we construct a demographically/topically split Reddit dataset, Splits!, validated by self-identification and by replicating several known SLPs from existing literature. We showcase the sandbox's utility with a scalable, two-stage process that filters large collections of "potential" SLPs (PSLPs) to surface the most promising candidates for deeper, qualitative investigation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a 'sandbox' for scalable sociocultural linguistic research by constructing Splits!, a Reddit dataset with demographic and topical splits derived from self-identification, validates it by replicating known SLPs from the literature, and demonstrates utility via a two-stage filtering process that narrows large sets of potential SLPs (PSLPs) to candidates suitable for qualitative follow-up.
Significance. If the demographic/topical splits reliably preserve fine-grained sociocultural signals with low label noise, the sandbox could enable rapid, systematic hypothesis exploration in sociolinguistics, reducing reliance on bespoke data collection for each new group or topic.
major comments (2)
- [Validation section] Validation section (and abstract): the claim that self-identification plus replication of known SLPs validates the splits for discovering new PSLPs lacks any reported quantitative assessment of label noise, precision/recall of the demographic partitions, or error analysis; voluntary Reddit disclosures are context-dependent and potentially performative, yet no evidence is given that the resulting partitions preserve signals for subtler, undocumented phenomena.
- [§5] §5 (utility demonstration): the two-stage PSLP filter is presented as scalable, but the manuscript provides no ablation or comparison showing that the demographic/topical splits are necessary for surfacing candidates that would be missed by topic-only or un-split baselines; replication of already-known SLPs is a low bar and does not test the core utility claim.
minor comments (2)
- [Abstract] Abstract and introduction: the example contrasting Chinese and American 'healthy eating' discourse is presented without citation to the source study or dataset.
- [Introduction] Notation: 'PSLPs' and 'SLPs' are introduced without an explicit definition or distinction in the first use.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major point below and agree that revisions are needed to strengthen the validation and utility sections.
read point-by-point responses
-
Referee: [Validation section] Validation section (and abstract): the claim that self-identification plus replication of known SLPs validates the splits for discovering new PSLPs lacks any reported quantitative assessment of label noise, precision/recall of the demographic partitions, or error analysis; voluntary Reddit disclosures are context-dependent and potentially performative, yet no evidence is given that the resulting partitions preserve signals for subtler, undocumented phenomena.
Authors: We agree that the validation relies on self-identification and replication of known SLPs without quantitative metrics such as label noise estimates, precision/recall, or error analysis. Voluntary disclosures can indeed be context-dependent. In revision, we will add a dedicated limitations subsection in the validation section, include any available error analysis from the data construction, and moderate the abstract and validation claims to position the replication as supporting rather than conclusive evidence for new PSLPs. revision: yes
-
Referee: [§5] §5 (utility demonstration): the two-stage PSLP filter is presented as scalable, but the manuscript provides no ablation or comparison showing that the demographic/topical splits are necessary for surfacing candidates that would be missed by topic-only or un-split baselines; replication of already-known SLPs is a low bar and does not test the core utility claim.
Authors: We acknowledge that no ablation or baseline comparison is provided to isolate the contribution of the demographic/topical splits versus topic-only or unsplit data. The replication of known SLPs validates the dataset but does not fully test discovery of new phenomena. In the revised manuscript, we will add an ablation study comparing the two-stage filter with and without splits to demonstrate their necessity for surfacing unique candidates. revision: yes
Circularity Check
No circularity: dataset construction and external validation are independent
full rationale
The paper's central contribution is the construction of the Splits! Reddit dataset via demographic and topical partitions derived from self-identification, followed by a two-stage filtering process for potential SLPs. Validation occurs by checking replication of known SLPs drawn from prior external literature and by the self-identification labels themselves, but this does not reduce any claimed result to the inputs by construction; replication of established phenomena serves as an independent check rather than a tautology. No equations, parameter fitting, uniqueness theorems, or self-citation chains appear in the abstract or described method that would force outputs to equal inputs. The work is a resource-building and tooling paper whose claims remain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Demographic and topical splits on Reddit posts can capture and replicate sociocultural linguistic phenomena
Reference graph
Works this paper leans on
-
[1]
Neu- ral Computing and Applications, 35:5113–5144
A systematic review of machine learning tech- niques for stance detection and its applications. Neu- ral Computing and Applications, 35:5113–5144. American Jewish Committee. 2012. 2012 AJC Survey of American Jewish Opinion: Data Summary. Data report. Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate
work page 2012
-
[2]
Political Analysis, 31(3):337–351
Out of one, many: Using language mod- els to simulate human samples. Political Analysis, 31(3):337–351. Sören Arlt, Carlos Ruiz-Gonzalez, and Mario Krenn
-
[3]
Digital Discovery of a Scientific Concept at the Core of Experimental Quantum Optics. _eprint: 2210.09981. Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. The AI Magazine, 36(1):15–24. Puput Puji Astuti. 2018. THE USE OF AFRICAN- AMERICAN VERNACULAR ENGLISH (AA VE) IN LOGIC’S EVERYBODY. Jinan C. Banna...
-
[4]
The influence of linguistic form and causal ex- planations on the development of social essentialism. Cognition, 229:105246. Sarah Bunin Benor and Steven M Cohen. Talking Jew- ish: The “Ethnic English” of American Jews. S.B. Benor. 2012. Becoming Frum: How Newcom- ers Learn the Language and Culture of Orthodox Judaism. Jewish Cultures of the World. Rutger...
-
[5]
Interpretation of NLP models through input marginalization. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), Online. Association for Com- putational Linguistics. Mingyang Li, Louis Hickman, Louis Tay, Lyle Ungar, and Sharath Chandra Guntuku. 2020. Studying Po- liteness across Cultures Using English Twitt...
-
[6]
A Unified Approach to Interpreting Model Predictions
Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362. Chen Cecilia Liu, Iryna Gurevych, and Anna Korho- nen. 2025. Culturally aware and adapted ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1002/9780470754856.ch2 2021
-
[7]
Enhancing Creativity in Large Language Mod- els through Associative Thinking Strategies. arXiv preprint. ArXiv:2405.06715 [cs]. Moran Mizrahi, Chen Shani, Gabriel Stanovsky, Dan Ju- rafsky, and Dafna Shahaf. 2025. Cooking Up Creativ- ity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations. arXiv preprint. ArXiv...
-
[8]
"we demand justice!": Towards social context grounding of political texts. Preprint, arXiv:2311.09106. Gal Raayoni, Shahar Gottlieb, Yahel Manor, George Pisha, Yoav Harris, Uri Mendlovic, Doron Haviv, Yaron Hadad, and Ido Kaminer. 2021. Generating conjectures on fundamental constants with the Ra- manujan Machine. Nature, 590(7844):67–73. Pub- lisher: Natu...
-
[9]
“Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Demonstrations, San Diego, California. Association for Computational Linguistics. Anjali Roy. 2011. Meanings of Bhangra and Bolly- wood Dancing in India and the Dia...
work page 2016
-
[10]
Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf- sky, Noah A
Whose opinions do language models reflect? Preprint, arXiv:2303.17548. Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf- sky, Noah A. Smith, and Yejin Choi. 2020. Social bias frames: Reasoning about social and power im- plications of language. Preprint, arXiv:1911.03891. Seth J. Schwartz, Byron L. Zamboanga, and Robert S. Weisskirch. 2008. Broadening t...
-
[11]
discus- sion of Christmas trees on the moon fighting over a purple golf club
Inducing a lexicon of sociolinguistic variables from code-mixed text. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 1–6, Brussels, Bel- gium. Association for Computational Linguistics. Geneva Smitherman. 2007. African American English. GRIN Verlag. Ian Stewart. 2014. Now We Stronger than Ever: Africa...
work page 2018
-
[12]
First think about what you know about the two demographics A and B, especially when they talk about the given topic. What kinds of words/phrases might Demographic A use that De- mographic B would not? Specifically, we care about such words/phrases that are not obvious, or unexpected. Example: When talking about "recipes", Indian people when contrasted wit...
-
[13]
Once you are ready, think about how these keywords compare to what you came up with. Were you surprised that the keywords worked in distin- guishing the two groups? To measure the consistency and quality of the an- notations, we calculated the Intraclass Correlation Coefficient. Using the ‘ICC(2,k)’ two-way random effects model, which assesses the reliabi...
-
[15]
Otherwise output fewer but high quality words
Output as many words and phrases as you think are appropriate: – If there are many differentiating things, output many words. Otherwise output fewer but high quality words
-
[17]
Don’toutput reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Chinese Contrast demographic: Russian ### Example Output: The target demographic is Chinese, and the contrast demographic is Russian. The goal is to find English-language words and phrases that appear in posts by people from the Chinese demographic but...
-
[18]
Spring Festival Gala
-
[19]
C-pop ... ### Task Input: Target demographic: {target} Contrast demographic: {contrast} ### Task Output: Figure 15: Prompt when 2 demographics are given. 19 ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same way as the ...
-
[23]
Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Sri Lankan ### Example Output: Sri Lankans are a South Asian demographic with diverse linguistic and cultural backgrounds, primarily Sinhala and Tamil speakers, but English is widely used in online posts, especially among urban youth and diaspora comm...
-
[24]
Ranil ... ### Task Input: Target demographic: {target} ### Task Output: Figure 16: Prompt when only target demographics is given. ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same way as the example provided. ### Task ...
-
[25]
ONLY output words (and phrases) that can be directly searched. No theories like “they might use...”
-
[28]
Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Sri Lankan Topic: cricket ### Example Output: Sri Lankans are deeply passionate about cricket—it’s the most popular sport in the country and a major source of national pride. Sri Lankan cricket fans often reference legendary players, local teams, and ...
-
[29]
Lankan fighting spirit
-
[30]
Proud to be Sri Lankan ... ### Task Input: Target demographic: {target} Topic: {topic} ### Task Output Figure 17: Prompt when target demographic and topic is given. 20 ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same ...
-
[31]
ONLY output words (and phrases) that can be directly searched. No theories like "they might use..."
-
[32]
Otherwise output less but high quality words
Output as many words and phrases as you think are appropriate: - If there are many differentiating things, output many words. Otherwise output less but high quality words
-
[33]
Mention out loud what you know about them and then generate the words
First think about the demographics given out loud. Mention out loud what you know about them and then generate the words
-
[34]
Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Chinese Contrast demographic: Russian Topic: cooking ### Example Output: Reasoning: Chinese cooking culture emphasizes diverse regional cuisines like Sichuan, Cantonese, Hunan, and Shanghainese. It includes techniques such as stir-frying, steaming, br...
-
[35]
liangpi . . . ### Task Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} ### Task Output Figure 18: Prompt when both demographics and topic is given. ### Task Details: You are given a topic and two demographics; a target demographic and contrast demographic. Your task is to come up with 15 cultural, sociological, or lingu...
-
[36]
Theory 1: <your first theory> Keywords and Phrases: <word>, <phrase>,
-
[37]
Theory 15: <your last theory> Keywords and Phrases: <word>, <phrase>, ... ### Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} Figure 19: Prompt when both demographics and topic is given to generate creative lexicon. 21
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.