pith. sign in

arxiv: 2504.04640 · v3 · submitted 2025-04-06 · 💻 cs.CL · cs.AI

Splits! Flexible Sociocultural Linguistic Investigation at Scale

Pith reviewed 2026-05-22 21:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords sociocultural linguisticsReddit datasetSLPlanguage variationdemographic splitscomputational sociolinguisticssandboxpotential SLPs
0
0 comments X

The pith

A demographically and topically split Reddit dataset creates a reusable sandbox for scalable sociocultural linguistic research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to move beyond the need for tailored data collection for each new study of how language reflects sociocultural backgrounds. It does so by constructing Splits!, a Reddit dataset divided along demographic and topical lines. The dataset receives validation through user self-identification and successful replication of several known sociocultural linguistic phenomena reported in prior work. A two-stage filtering process then narrows large collections of potential SLPs down to the most promising candidates for qualitative follow-up. This approach supports quicker hypothesis exploration and prototyping without starting from specialized data gathering each time.

Core claim

The authors construct Splits!, a demographically and topically split Reddit dataset, and validate it by self-identification and replication of known SLPs. They show its utility through a scalable two-stage process that filters collections of potential SLPs to surface promising candidates for deeper investigation.

What carries the argument

The Splits! dataset of demographically and topically split Reddit posts, which supports flexible exploration of how sociocultural backgrounds shape language use.

If this is right

  • Researchers can explore hypotheses about language shaped by background without specialized data collection for each project.
  • Known SLPs from the literature can be replicated systematically across the dataset.
  • Large numbers of potential SLPs can be narrowed efficiently to those worth deeper qualitative work.
  • The sandbox supports scalable prototyping of analyses that link context and background to language patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar splitting methods could extend to other online text collections to study additional cultural contexts.
  • The splits might reveal when language models trained on mixed data overlook demographic-specific patterns.
  • Further replication tests on additional known phenomena would test the dataset's broader reliability.

Load-bearing premise

Demographic and topical splits on Reddit data accurately capture sociocultural linguistic phenomena in a way that permits replication of known patterns.

What would settle it

If replication of multiple known SLPs from the literature on the Splits! dataset produces results inconsistent with prior findings, the validation would fail.

Figures

Figures reproduced from arXiv: 2504.04640 by Dan Goldwasser, Eylon Caplan, Tania Chakraborty.

Figure 1
Figure 1. Figure 1: Visualization of the seed subreddit discovery pro￾cess. Each bubble is a seed subreddit, sized by post volume and positioned by user overlap with other seeds, and clustering validates our iterative expansion method. Crucially, this plot shows raw user overlap between subreddits, not final demo￾graphic user groups, which are filtered to be nearly disjoint. We use Reddit data from (Chang et al., 2020), which… view at source ↗
Figure 2
Figure 2. Figure 2: Normalized self-identification rate vs. group-ness of the Catholic demographic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SPLITS! dataset and our evaluation framework: 2 demographics discussing the same topic are combined, indexed, and reranked using the input PSLP Lexicon. Triviality is computed to encourage unexpected PSLPs. and phrases more than demographic B when dis￾cussing topic t”. We represent this as PSLPL,A,B,t [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: All PSLPs including case studies (subsampled). Most promising PSLPS in upper left: lift > 1, low triviality. within [0, 1]. We precisely define triviality as triv(P SLPL,A,B,t) := Rsubspace(L, ℓA). As such, the more words in the lexicon L that are semantically similar to the target demographic A as a whole, the more trivial it becomes. We note that the ‘triviality’ metric can be easily modified to fit spec… view at source ↗
Figure 5
Figure 5. Figure 5: , to see if it had been studied before. The lexicon implies that Hindus/Jains/Sikhs discuss ru￾ral economic issues (rural jobs, rural distress, rural development) more than Jewish people when dis￾cussing elections. We found some work indicating that the South Asian community (predominantly Hindu, Sikh, Jain) is indeed more concerned with agricultural economic issues, whereas the Jewish community focuses on… view at source ↗
Figure 6
Figure 6. Figure 6: Normalized self-identification rate vs. group-ness of the remaining demographics. Category Topics Sports basketball, soccer, football. . . Entertainment superheroes, sci-fi, fantasy. . . Tech/Gaming pc builds, coding, AI. . . Careers jobs, resumes, freelance. . . Hobbies gardening, cooking, crafts. . . Finance budgets, stocks, retiring. . . Education college, study tips, exams. . . News global, politics, e… view at source ↗
Figure 7
Figure 7. Figure 7: User intersectionality in the Splits! dataset. B Case Studies of Known SLPs Jewish English Benor and Cohen studied the vo￾cabulary of American Jews, noting a difference in the usage of certain Yiddish and Hebrew words. Further, Benor (2012); McWhorter (2013) study 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Heatmap of combined indicies by demographic [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distributions of Triviality by prompt type. higher use of Yiddish/Hebrew in Judaism than in Professional topics. This means that not only do Jewish people use Yiddish/Hebrew features more than non-Jewish users, but they use these features far more in certain contexts. These two results to￾gether show that the dataset captures the known SLP of (1) Jewish Yiddish/Hebrew use and (2) Jew￾ish code-switching [… view at source ↗
Figure 11
Figure 11. Figure 11: Lift at 0.5 of the Black demographic when talking about Hip-Hop/Rap using AAVE lexicon, as contrasted with 4 other demographics. Jewish: (avg. 1.124@0.5%), but are about aver￾age in triviality (0.746 Triviality) 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Lift at 0.5 of the Jewish demographic when talking about Judaism using Yiddish/Hebrew, as contrasted with 3 other demographics [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Lift at 0.5 of the Hindu Jain Sikh demographic when talking about Personal Cultural Identity using "dance", as contrasted with 4 other demographics. annotators were presented with a Target Demo￾graphic (e.g., “Jewish”), a Contrast Demographic (e.g., "Catholic"), a Topic (e.g., "Elections"), and a Lexicon (e.g., ‘"ballot access", "voter registration", "gerrymandering"‘). They were then instructed to rate t… view at source ↗
Figure 15
Figure 15. Figure 15: Prompt when 2 demographics are given. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt when target demographic and topic [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt when both demographics and topic is given. ### Task Details: You are given a topic and two demographics; a target demographic and contrast demographic. Your task is to come up with 15 cultural, sociological, or linguistic theories about how the target group talks about the topic, especially as opposed to the contrast demographic. Then for each theory, come up with keywords and phrases to help retri… view at source ↗
read the original abstract

Variation in language use, shaped by speakers' sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. For example, Chinese students discuss "healthy eating" with words like "timing," "regularity," and "digestion," whereas Americans use vocabulary like "balancing food groups" and "avoiding fat and sugar," reflecting distinct cultural models of nutrition. The computational study of these Sociocultural Linguistic Phenomena (SLP) has traditionally been done in NLP via tailored analyses of specific groups or topics, requiring specialized data collection and experimental operationalization--a process not well-suited to quick hypothesis exploration and prototyping. To address this, we propose constructing a "sandbox" designed for systematic and flexible sociolinguistic research. Using our method, we construct a demographically/topically split Reddit dataset, Splits!, validated by self-identification and by replicating several known SLPs from existing literature. We showcase the sandbox's utility with a scalable, two-stage process that filters large collections of "potential" SLPs (PSLPs) to surface the most promising candidates for deeper, qualitative investigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a 'sandbox' for scalable sociocultural linguistic research by constructing Splits!, a Reddit dataset with demographic and topical splits derived from self-identification, validates it by replicating known SLPs from the literature, and demonstrates utility via a two-stage filtering process that narrows large sets of potential SLPs (PSLPs) to candidates suitable for qualitative follow-up.

Significance. If the demographic/topical splits reliably preserve fine-grained sociocultural signals with low label noise, the sandbox could enable rapid, systematic hypothesis exploration in sociolinguistics, reducing reliance on bespoke data collection for each new group or topic.

major comments (2)
  1. [Validation section] Validation section (and abstract): the claim that self-identification plus replication of known SLPs validates the splits for discovering new PSLPs lacks any reported quantitative assessment of label noise, precision/recall of the demographic partitions, or error analysis; voluntary Reddit disclosures are context-dependent and potentially performative, yet no evidence is given that the resulting partitions preserve signals for subtler, undocumented phenomena.
  2. [§5] §5 (utility demonstration): the two-stage PSLP filter is presented as scalable, but the manuscript provides no ablation or comparison showing that the demographic/topical splits are necessary for surfacing candidates that would be missed by topic-only or un-split baselines; replication of already-known SLPs is a low bar and does not test the core utility claim.
minor comments (2)
  1. [Abstract] Abstract and introduction: the example contrasting Chinese and American 'healthy eating' discourse is presented without citation to the source study or dataset.
  2. [Introduction] Notation: 'PSLPs' and 'SLPs' are introduced without an explicit definition or distinction in the first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below and agree that revisions are needed to strengthen the validation and utility sections.

read point-by-point responses
  1. Referee: [Validation section] Validation section (and abstract): the claim that self-identification plus replication of known SLPs validates the splits for discovering new PSLPs lacks any reported quantitative assessment of label noise, precision/recall of the demographic partitions, or error analysis; voluntary Reddit disclosures are context-dependent and potentially performative, yet no evidence is given that the resulting partitions preserve signals for subtler, undocumented phenomena.

    Authors: We agree that the validation relies on self-identification and replication of known SLPs without quantitative metrics such as label noise estimates, precision/recall, or error analysis. Voluntary disclosures can indeed be context-dependent. In revision, we will add a dedicated limitations subsection in the validation section, include any available error analysis from the data construction, and moderate the abstract and validation claims to position the replication as supporting rather than conclusive evidence for new PSLPs. revision: yes

  2. Referee: [§5] §5 (utility demonstration): the two-stage PSLP filter is presented as scalable, but the manuscript provides no ablation or comparison showing that the demographic/topical splits are necessary for surfacing candidates that would be missed by topic-only or un-split baselines; replication of already-known SLPs is a low bar and does not test the core utility claim.

    Authors: We acknowledge that no ablation or baseline comparison is provided to isolate the contribution of the demographic/topical splits versus topic-only or unsplit data. The replication of known SLPs validates the dataset but does not fully test discovery of new phenomena. In the revised manuscript, we will add an ablation study comparing the two-stage filter with and without splits to demonstrate their necessity for surfacing unique candidates. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction and external validation are independent

full rationale

The paper's central contribution is the construction of the Splits! Reddit dataset via demographic and topical partitions derived from self-identification, followed by a two-stage filtering process for potential SLPs. Validation occurs by checking replication of known SLPs drawn from prior external literature and by the self-identification labels themselves, but this does not reduce any claimed result to the inputs by construction; replication of established phenomena serves as an independent check rather than a tautology. No equations, parameter fitting, uniqueness theorems, or self-citation chains appear in the abstract or described method that would force outputs to equal inputs. The work is a resource-building and tooling paper whose claims remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that Reddit data splits by demographics and topics can serve as a valid proxy for sociocultural linguistic variation; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Demographic and topical splits on Reddit posts can capture and replicate sociocultural linguistic phenomena
    This premise underpins both the dataset construction and the validation by self-identification and known SLP replication.

pith-pipeline@v0.9.0 · 5724 in / 1289 out tokens · 198719 ms · 2026-05-22T21:01:08.902621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Neu- ral Computing and Applications, 35:5113–5144

    A systematic review of machine learning tech- niques for stance detection and its applications. Neu- ral Computing and Applications, 35:5113–5144. American Jewish Committee. 2012. 2012 AJC Survey of American Jewish Opinion: Data Summary. Data report. Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate

  2. [2]

    Political Analysis, 31(3):337–351

    Out of one, many: Using language mod- els to simulate human samples. Political Analysis, 31(3):337–351. Sören Arlt, Carlos Ruiz-Gonzalez, and Mario Krenn

  3. [3]

    _eprint: 2210.09981

    Digital Discovery of a Scientific Concept at the Core of Experimental Quantum Optics. _eprint: 2210.09981. Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. The AI Magazine, 36(1):15–24. Puput Puji Astuti. 2018. THE USE OF AFRICAN- AMERICAN VERNACULAR ENGLISH (AA VE) IN LOGIC’S EVERYBODY. Jinan C. Banna...

  4. [4]

    Ethnic English

    The influence of linguistic form and causal ex- planations on the development of social essentialism. Cognition, 229:105246. Sarah Bunin Benor and Steven M Cohen. Talking Jew- ish: The “Ethnic English” of American Jews. S.B. Benor. 2012. Becoming Frum: How Newcom- ers Learn the Language and Culture of Orthodox Judaism. Jewish Cultures of the World. Rutger...

  5. [5]

    In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), Online

    Interpretation of NLP models through input marginalization. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), Online. Association for Com- putational Linguistics. Mingyang Li, Louis Hickman, Louis Tay, Lyle Ungar, and Sharath Chandra Guntuku. 2020. Studying Po- liteness across Cultures Using English Twitt...

  6. [6]

    A Unified Approach to Interpreting Model Predictions

    Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362. Chen Cecilia Liu, Iryna Gurevych, and Anna Korho- nen. 2025. Culturally aware and adapted ...

  7. [7]

    arXiv preprint

    Enhancing Creativity in Large Language Mod- els through Associative Thinking Strategies. arXiv preprint. ArXiv:2405.06715 [cs]. Moran Mizrahi, Chen Shani, Gabriel Stanovsky, Dan Ju- rafsky, and Dafna Shahaf. 2025. Cooking Up Creativ- ity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations. arXiv preprint. ArXiv...

  8. [8]

    we demand justice!

    "we demand justice!": Towards social context grounding of political texts. Preprint, arXiv:2311.09106. Gal Raayoni, Shahar Gottlieb, Yahel Manor, George Pisha, Yoav Harris, Uri Mendlovic, Doron Haviv, Yaron Hadad, and Ido Kaminer. 2021. Generating conjectures on fundamental constants with the Ra- manujan Machine. Nature, 590(7844):67–73. Pub- lisher: Natu...

  9. [9]

    Why Should I Trust You?

    “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Demonstrations, San Diego, California. Association for Computational Linguistics. Anjali Roy. 2011. Meanings of Bhangra and Bolly- wood Dancing in India and the Dia...

  10. [10]

    Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf- sky, Noah A

    Whose opinions do language models reflect? Preprint, arXiv:2303.17548. Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf- sky, Noah A. Smith, and Yejin Choi. 2020. Social bias frames: Reasoning about social and power im- plications of language. Preprint, arXiv:1911.03891. Seth J. Schwartz, Byron L. Zamboanga, and Robert S. Weisskirch. 2008. Broadening t...

  11. [11]

    discus- sion of Christmas trees on the moon fighting over a purple golf club

    Inducing a lexicon of sociolinguistic variables from code-mixed text. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 1–6, Brussels, Bel- gium. Association for Computational Linguistics. Geneva Smitherman. 2007. African American English. GRIN Verlag. Ian Stewart. 2014. Now We Stronger than Ever: Africa...

  12. [12]

    What kinds of words/phrases might Demographic A use that De- mographic B would not? Specifically, we care about such words/phrases that are not obvious, or unexpected

    First think about what you know about the two demographics A and B, especially when they talk about the given topic. What kinds of words/phrases might Demographic A use that De- mographic B would not? Specifically, we care about such words/phrases that are not obvious, or unexpected. Example: When talking about "recipes", Indian people when contrasted wit...

  13. [13]

    promising

    Once you are ready, think about how these keywords compare to what you came up with. Were you surprised that the keywords worked in distin- guishing the two groups? To measure the consistency and quality of the an- notations, we calculated the Intraclass Correlation Coefficient. Using the ‘ICC(2,k)’ two-way random effects model, which assesses the reliabi...

  14. [15]

    Otherwise output fewer but high quality words

    Output as many words and phrases as you think are appropriate: – If there are many differentiating things, output many words. Otherwise output fewer but high quality words

  15. [17]

    core socialist values

    Don’toutput reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Chinese Contrast demographic: Russian ### Example Output: The target demographic is Chinese, and the contrast demographic is Russian. The goal is to find English-language words and phrases that appear in posts by people from the Chinese demographic but...

  16. [18]

    Spring Festival Gala

  17. [19]

    ### Task Input: Target demographic: {target} Contrast demographic: {contrast} ### Task Output: Figure 15: Prompt when 2 demographics are given

    C-pop ... ### Task Input: Target demographic: {target} Contrast demographic: {contrast} ### Task Output: Figure 15: Prompt when 2 demographics are given. 19 ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same way as the ...

  18. [23]

    aragalaya

    Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Sri Lankan ### Example Output: Sri Lankans are a South Asian demographic with diverse linguistic and cultural backgrounds, primarily Sinhala and Tamil speakers, but English is widely used in online posts, especially among urban youth and diaspora comm...

  19. [24]

    ### Task Input: Target demographic: {target} ### Task Output: Figure 16: Prompt when only target demographics is given

    Ranil ... ### Task Input: Target demographic: {target} ### Task Output: Figure 16: Prompt when only target demographics is given. ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same way as the example provided. ### Task ...

  20. [25]

    they might use

    ONLY output words (and phrases) that can be directly searched. No theories like “they might use...”

  21. [28]

    Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Sri Lankan Topic: cricket ### Example Output: Sri Lankans are deeply passionate about cricket—it’s the most popular sport in the country and a major source of national pride. Sri Lankan cricket fans often reference legendary players, local teams, and ...

  22. [29]

    Lankan fighting spirit

  23. [30]

    ### Task Input: Target demographic: {target} Topic: {topic} ### Task Output Figure 17: Prompt when target demographic and topic is given

    Proud to be Sri Lankan ... ### Task Input: Target demographic: {target} Topic: {topic} ### Task Output Figure 17: Prompt when target demographic and topic is given. 20 ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same ...

  24. [31]

    they might use

    ONLY output words (and phrases) that can be directly searched. No theories like "they might use..."

  25. [32]

    Otherwise output less but high quality words

    Output as many words and phrases as you think are appropriate: - If there are many differentiating things, output many words. Otherwise output less but high quality words

  26. [33]

    Mention out loud what you know about them and then generate the words

    First think about the demographics given out loud. Mention out loud what you know about them and then generate the words

  27. [34]

    Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Chinese Contrast demographic: Russian Topic: cooking ### Example Output: Reasoning: Chinese cooking culture emphasizes diverse regional cuisines like Sichuan, Cantonese, Hunan, and Shanghainese. It includes techniques such as stir-frying, steaming, br...

  28. [35]

    genetics

    liangpi . . . ### Task Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} ### Task Output Figure 18: Prompt when both demographics and topic is given. ### Task Details: You are given a topic and two demographics; a target demographic and contrast demographic. Your task is to come up with 15 cultural, sociological, or lingu...

  29. [36]

    Theory 1: <your first theory> Keywords and Phrases: <word>, <phrase>,

  30. [37]

    ### Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} Figure 19: Prompt when both demographics and topic is given to generate creative lexicon

    Theory 15: <your last theory> Keywords and Phrases: <word>, <phrase>, ... ### Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} Figure 19: Prompt when both demographics and topic is given to generate creative lexicon. 21