Splits! Flexible Sociocultural Linguistic Investigation at Scale

Dan Goldwasser; Eylon Caplan; Tania Chakraborty

arxiv: 2504.04640 · v3 · submitted 2025-04-06 · 💻 cs.CL · cs.AI

Splits! Flexible Sociocultural Linguistic Investigation at Scale

Eylon Caplan , Tania Chakraborty , Dan Goldwasser This is my paper

Pith reviewed 2026-05-22 21:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords sociocultural linguisticsReddit datasetSLPlanguage variationdemographic splitscomputational sociolinguisticssandboxpotential SLPs

0 comments

The pith

A demographically and topically split Reddit dataset creates a reusable sandbox for scalable sociocultural linguistic research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to move beyond the need for tailored data collection for each new study of how language reflects sociocultural backgrounds. It does so by constructing Splits!, a Reddit dataset divided along demographic and topical lines. The dataset receives validation through user self-identification and successful replication of several known sociocultural linguistic phenomena reported in prior work. A two-stage filtering process then narrows large collections of potential SLPs down to the most promising candidates for qualitative follow-up. This approach supports quicker hypothesis exploration and prototyping without starting from specialized data gathering each time.

Core claim

The authors construct Splits!, a demographically and topically split Reddit dataset, and validate it by self-identification and replication of known SLPs. They show its utility through a scalable two-stage process that filters collections of potential SLPs to surface promising candidates for deeper investigation.

What carries the argument

The Splits! dataset of demographically and topically split Reddit posts, which supports flexible exploration of how sociocultural backgrounds shape language use.

If this is right

Researchers can explore hypotheses about language shaped by background without specialized data collection for each project.
Known SLPs from the literature can be replicated systematically across the dataset.
Large numbers of potential SLPs can be narrowed efficiently to those worth deeper qualitative work.
The sandbox supports scalable prototyping of analyses that link context and background to language patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar splitting methods could extend to other online text collections to study additional cultural contexts.
The splits might reveal when language models trained on mixed data overlook demographic-specific patterns.
Further replication tests on additional known phenomena would test the dataset's broader reliability.

Load-bearing premise

Demographic and topical splits on Reddit data accurately capture sociocultural linguistic phenomena in a way that permits replication of known patterns.

What would settle it

If replication of multiple known SLPs from the literature on the Splits! dataset produces results inconsistent with prior findings, the validation would fail.

Figures

Figures reproduced from arXiv: 2504.04640 by Dan Goldwasser, Eylon Caplan, Tania Chakraborty.

**Figure 1.** Figure 1: Visualization of the seed subreddit discovery process. Each bubble is a seed subreddit, sized by post volume and positioned by user overlap with other seeds, and clustering validates our iterative expansion method. Crucially, this plot shows raw user overlap between subreddits, not final demographic user groups, which are filtered to be nearly disjoint. We use Reddit data from (Chang et al., 2020), which… view at source ↗

**Figure 2.** Figure 2: Normalized self-identification rate vs. group-ness of the Catholic demographic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: SPLITS! dataset and our evaluation framework: 2 demographics discussing the same topic are combined, indexed, and reranked using the input PSLP Lexicon. Triviality is computed to encourage unexpected PSLPs. and phrases more than demographic B when discussing topic t”. We represent this as PSLPL,A,B,t [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: All PSLPs including case studies (subsampled). Most promising PSLPS in upper left: lift > 1, low triviality. within [0, 1]. We precisely define triviality as triv(P SLPL,A,B,t) := Rsubspace(L, ℓA). As such, the more words in the lexicon L that are semantically similar to the target demographic A as a whole, the more trivial it becomes. We note that the ‘triviality’ metric can be easily modified to fit spec… view at source ↗

**Figure 5.** Figure 5: , to see if it had been studied before. The lexicon implies that Hindus/Jains/Sikhs discuss rural economic issues (rural jobs, rural distress, rural development) more than Jewish people when discussing elections. We found some work indicating that the South Asian community (predominantly Hindu, Sikh, Jain) is indeed more concerned with agricultural economic issues, whereas the Jewish community focuses on… view at source ↗

**Figure 6.** Figure 6: Normalized self-identification rate vs. group-ness of the remaining demographics. Category Topics Sports basketball, soccer, football. . . Entertainment superheroes, sci-fi, fantasy. . . Tech/Gaming pc builds, coding, AI. . . Careers jobs, resumes, freelance. . . Hobbies gardening, cooking, crafts. . . Finance budgets, stocks, retiring. . . Education college, study tips, exams. . . News global, politics, e… view at source ↗

**Figure 7.** Figure 7: User intersectionality in the Splits! dataset. B Case Studies of Known SLPs Jewish English Benor and Cohen studied the vocabulary of American Jews, noting a difference in the usage of certain Yiddish and Hebrew words. Further, Benor (2012); McWhorter (2013) study 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 9.** Figure 9: Heatmap of combined indicies by demographic [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Distributions of Triviality by prompt type. higher use of Yiddish/Hebrew in Judaism than in Professional topics. This means that not only do Jewish people use Yiddish/Hebrew features more than non-Jewish users, but they use these features far more in certain contexts. These two results together show that the dataset captures the known SLP of (1) Jewish Yiddish/Hebrew use and (2) Jewish code-switching [… view at source ↗

**Figure 11.** Figure 11: Lift at 0.5 of the Black demographic when talking about Hip-Hop/Rap using AAVE lexicon, as contrasted with 4 other demographics. Jewish: (avg. 1.124@0.5%), but are about average in triviality (0.746 Triviality) 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Lift at 0.5 of the Jewish demographic when talking about Judaism using Yiddish/Hebrew, as contrasted with 3 other demographics [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Lift at 0.5 of the Hindu Jain Sikh demographic when talking about Personal Cultural Identity using "dance", as contrasted with 4 other demographics. annotators were presented with a Target Demographic (e.g., “Jewish”), a Contrast Demographic (e.g., "Catholic"), a Topic (e.g., "Elections"), and a Lexicon (e.g., ‘"ballot access", "voter registration", "gerrymandering"‘). They were then instructed to rate t… view at source ↗

**Figure 15.** Figure 15: Prompt when 2 demographics are given. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 17.** Figure 17: Prompt when target demographic and topic [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt when both demographics and topic is given. ### Task Details: You are given a topic and two demographics; a target demographic and contrast demographic. Your task is to come up with 15 cultural, sociological, or linguistic theories about how the target group talks about the topic, especially as opposed to the contrast demographic. Then for each theory, come up with keywords and phrases to help retri… view at source ↗

read the original abstract

Variation in language use, shaped by speakers' sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. For example, Chinese students discuss "healthy eating" with words like "timing," "regularity," and "digestion," whereas Americans use vocabulary like "balancing food groups" and "avoiding fat and sugar," reflecting distinct cultural models of nutrition. The computational study of these Sociocultural Linguistic Phenomena (SLP) has traditionally been done in NLP via tailored analyses of specific groups or topics, requiring specialized data collection and experimental operationalization--a process not well-suited to quick hypothesis exploration and prototyping. To address this, we propose constructing a "sandbox" designed for systematic and flexible sociolinguistic research. Using our method, we construct a demographically/topically split Reddit dataset, Splits!, validated by self-identification and by replicating several known SLPs from existing literature. We showcase the sandbox's utility with a scalable, two-stage process that filters large collections of "potential" SLPs (PSLPs) to surface the most promising candidates for deeper, qualitative investigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Splits! gives a practical split Reddit dataset and filtering pipeline for sociocultural linguistics, but replicating known patterns offers only weak support for finding new ones.

read the letter

The main point is a new demographically and topically split Reddit dataset called Splits! built as a reusable sandbox, plus a two-stage process to filter potential SLPs down to promising ones for closer study. They validate the splits by matching self-identifications and by recovering several established SLPs from prior work. This setup moves away from custom data collection for each new question toward something more off-the-shelf for quick tests.

Referee Report

2 major / 2 minor

Summary. The paper introduces a 'sandbox' for scalable sociocultural linguistic research by constructing Splits!, a Reddit dataset with demographic and topical splits derived from self-identification, validates it by replicating known SLPs from the literature, and demonstrates utility via a two-stage filtering process that narrows large sets of potential SLPs (PSLPs) to candidates suitable for qualitative follow-up.

Significance. If the demographic/topical splits reliably preserve fine-grained sociocultural signals with low label noise, the sandbox could enable rapid, systematic hypothesis exploration in sociolinguistics, reducing reliance on bespoke data collection for each new group or topic.

major comments (2)

[Validation section] Validation section (and abstract): the claim that self-identification plus replication of known SLPs validates the splits for discovering new PSLPs lacks any reported quantitative assessment of label noise, precision/recall of the demographic partitions, or error analysis; voluntary Reddit disclosures are context-dependent and potentially performative, yet no evidence is given that the resulting partitions preserve signals for subtler, undocumented phenomena.
[§5] §5 (utility demonstration): the two-stage PSLP filter is presented as scalable, but the manuscript provides no ablation or comparison showing that the demographic/topical splits are necessary for surfacing candidates that would be missed by topic-only or un-split baselines; replication of already-known SLPs is a low bar and does not test the core utility claim.

minor comments (2)

[Abstract] Abstract and introduction: the example contrasting Chinese and American 'healthy eating' discourse is presented without citation to the source study or dataset.
[Introduction] Notation: 'PSLPs' and 'SLPs' are introduced without an explicit definition or distinction in the first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below and agree that revisions are needed to strengthen the validation and utility sections.

read point-by-point responses

Referee: [Validation section] Validation section (and abstract): the claim that self-identification plus replication of known SLPs validates the splits for discovering new PSLPs lacks any reported quantitative assessment of label noise, precision/recall of the demographic partitions, or error analysis; voluntary Reddit disclosures are context-dependent and potentially performative, yet no evidence is given that the resulting partitions preserve signals for subtler, undocumented phenomena.

Authors: We agree that the validation relies on self-identification and replication of known SLPs without quantitative metrics such as label noise estimates, precision/recall, or error analysis. Voluntary disclosures can indeed be context-dependent. In revision, we will add a dedicated limitations subsection in the validation section, include any available error analysis from the data construction, and moderate the abstract and validation claims to position the replication as supporting rather than conclusive evidence for new PSLPs. revision: yes
Referee: [§5] §5 (utility demonstration): the two-stage PSLP filter is presented as scalable, but the manuscript provides no ablation or comparison showing that the demographic/topical splits are necessary for surfacing candidates that would be missed by topic-only or un-split baselines; replication of already-known SLPs is a low bar and does not test the core utility claim.

Authors: We acknowledge that no ablation or baseline comparison is provided to isolate the contribution of the demographic/topical splits versus topic-only or unsplit data. The replication of known SLPs validates the dataset but does not fully test discovery of new phenomena. In the revised manuscript, we will add an ablation study comparing the two-stage filter with and without splits to demonstrate their necessity for surfacing unique candidates. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction and external validation are independent

full rationale

The paper's central contribution is the construction of the Splits! Reddit dataset via demographic and topical partitions derived from self-identification, followed by a two-stage filtering process for potential SLPs. Validation occurs by checking replication of known SLPs drawn from prior external literature and by the self-identification labels themselves, but this does not reduce any claimed result to the inputs by construction; replication of established phenomena serves as an independent check rather than a tautology. No equations, parameter fitting, uniqueness theorems, or self-citation chains appear in the abstract or described method that would force outputs to equal inputs. The work is a resource-building and tooling paper whose claims remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that Reddit data splits by demographics and topics can serve as a valid proxy for sociocultural linguistic variation; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Demographic and topical splits on Reddit posts can capture and replicate sociocultural linguistic phenomena
This premise underpins both the dataset construction and the validation by self-identification and known SLP replication.

pith-pipeline@v0.9.0 · 5724 in / 1289 out tokens · 198719 ms · 2026-05-22T21:01:08.902621+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

Neu- ral Computing and Applications, 35:5113–5144

A systematic review of machine learning tech- niques for stance detection and its applications. Neu- ral Computing and Applications, 35:5113–5144. American Jewish Committee. 2012. 2012 AJC Survey of American Jewish Opinion: Data Summary. Data report. Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate

work page 2012
[2]

Political Analysis, 31(3):337–351

Out of one, many: Using language mod- els to simulate human samples. Political Analysis, 31(3):337–351. Sören Arlt, Carlos Ruiz-Gonzalez, and Mario Krenn

work page
[3]

_eprint: 2210.09981

Digital Discovery of a Scientific Concept at the Core of Experimental Quantum Optics. _eprint: 2210.09981. Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. The AI Magazine, 36(1):15–24. Puput Puji Astuti. 2018. THE USE OF AFRICAN- AMERICAN VERNACULAR ENGLISH (AA VE) IN LOGIC’S EVERYBODY. Jinan C. Banna...

work page arXiv 2015
[4]

Ethnic English

The influence of linguistic form and causal ex- planations on the development of social essentialism. Cognition, 229:105246. Sarah Bunin Benor and Steven M Cohen. Talking Jew- ish: The “Ethnic English” of American Jews. S.B. Benor. 2012. Becoming Frum: How Newcom- ers Learn the Language and Culture of Orthodox Judaism. Jewish Cultures of the World. Rutger...

work page doi:10.1080/14708470903267384 2012
[5]

In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), Online

Interpretation of NLP models through input marginalization. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), Online. Association for Com- putational Linguistics. Mingyang Li, Louis Hickman, Louis Tay, Lyle Ungar, and Sharath Chandra Guntuku. 2020. Studying Po- liteness across Cultures Using English Twitt...

work page arXiv 2020
[6]

A Unified Approach to Interpreting Model Predictions

Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362. Chen Cecilia Liu, Iryna Gurevych, and Anna Korho- nen. 2025. Culturally aware and adapted ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1002/9780470754856.ch2 2021
[7]

arXiv preprint

Enhancing Creativity in Large Language Mod- els through Associative Thinking Strategies. arXiv preprint. ArXiv:2405.06715 [cs]. Moran Mizrahi, Chen Shani, Gabriel Stanovsky, Dan Ju- rafsky, and Dafna Shahaf. 2025. Cooking Up Creativ- ity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations. arXiv preprint. ArXiv...

work page arXiv 2025
[8]

we demand justice!

"we demand justice!": Towards social context grounding of political texts. Preprint, arXiv:2311.09106. Gal Raayoni, Shahar Gottlieb, Yahel Manor, George Pisha, Yoav Harris, Uri Mendlovic, Doron Haviv, Yaron Hadad, and Ido Kaminer. 2021. Generating conjectures on fundamental constants with the Ra- manujan Machine. Nature, 590(7844):67–73. Pub- lisher: Natu...

work page arXiv 2021
[9]

Why Should I Trust You?

“Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Demonstrations, San Diego, California. Association for Computational Linguistics. Anjali Roy. 2011. Meanings of Bhangra and Bolly- wood Dancing in India and the Dia...

work page 2016
[10]

Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf- sky, Noah A

Whose opinions do language models reflect? Preprint, arXiv:2303.17548. Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf- sky, Noah A. Smith, and Yejin Choi. 2020. Social bias frames: Reasoning about social and power im- plications of language. Preprint, arXiv:1911.03891. Seth J. Schwartz, Byron L. Zamboanga, and Robert S. Weisskirch. 2008. Broadening t...

work page doi:10.1111/j.1751- 2020
[11]

discus- sion of Christmas trees on the moon fighting over a purple golf club

Inducing a lexicon of sociolinguistic variables from code-mixed text. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 1–6, Brussels, Bel- gium. Association for Computational Linguistics. Geneva Smitherman. 2007. African American English. GRIN Verlag. Ian Stewart. 2014. Now We Stronger than Ever: Africa...

work page 2018
[12]

What kinds of words/phrases might Demographic A use that De- mographic B would not? Specifically, we care about such words/phrases that are not obvious, or unexpected

First think about what you know about the two demographics A and B, especially when they talk about the given topic. What kinds of words/phrases might Demographic A use that De- mographic B would not? Specifically, we care about such words/phrases that are not obvious, or unexpected. Example: When talking about "recipes", Indian people when contrasted wit...

work page
[13]

promising

Once you are ready, think about how these keywords compare to what you came up with. Were you surprised that the keywords worked in distin- guishing the two groups? To measure the consistency and quality of the an- notations, we calculated the Intraclass Correlation Coefficient. Using the ‘ICC(2,k)’ two-way random effects model, which assesses the reliabi...

work page
[15]

Otherwise output fewer but high quality words

Output as many words and phrases as you think are appropriate: – If there are many differentiating things, output many words. Otherwise output fewer but high quality words

work page
[17]

core socialist values

Don’toutput reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Chinese Contrast demographic: Russian ### Example Output: The target demographic is Chinese, and the contrast demographic is Russian. The goal is to find English-language words and phrases that appear in posts by people from the Chinese demographic but...

work page
[18]

Spring Festival Gala

work page
[19]

### Task Input: Target demographic: {target} Contrast demographic: {contrast} ### Task Output: Figure 15: Prompt when 2 demographics are given

C-pop ... ### Task Input: Target demographic: {target} Contrast demographic: {contrast} ### Task Output: Figure 15: Prompt when 2 demographics are given. 19 ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same way as the ...

work page
[23]

aragalaya

Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Sri Lankan ### Example Output: Sri Lankans are a South Asian demographic with diverse linguistic and cultural backgrounds, primarily Sinhala and Tamil speakers, but English is widely used in online posts, especially among urban youth and diaspora comm...

work page
[24]

### Task Input: Target demographic: {target} ### Task Output: Figure 16: Prompt when only target demographics is given

Ranil ... ### Task Input: Target demographic: {target} ### Task Output: Figure 16: Prompt when only target demographics is given. ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same way as the example provided. ### Task ...

work page
[25]

they might use

ONLY output words (and phrases) that can be directly searched. No theories like “they might use...”

work page
[28]

Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Sri Lankan Topic: cricket ### Example Output: Sri Lankans are deeply passionate about cricket—it’s the most popular sport in the country and a major source of national pride. Sri Lankan cricket fans often reference legendary players, local teams, and ...

work page
[29]

Lankan fighting spirit

work page
[30]

### Task Input: Target demographic: {target} Topic: {topic} ### Task Output Figure 17: Prompt when target demographic and topic is given

Proud to be Sri Lankan ... ### Task Input: Target demographic: {target} Topic: {topic} ### Task Output Figure 17: Prompt when target demographic and topic is given. 20 ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same ...

work page
[31]

they might use

ONLY output words (and phrases) that can be directly searched. No theories like "they might use..."

work page
[32]

Otherwise output less but high quality words

Output as many words and phrases as you think are appropriate: - If there are many differentiating things, output many words. Otherwise output less but high quality words

work page
[33]

Mention out loud what you know about them and then generate the words

First think about the demographics given out loud. Mention out loud what you know about them and then generate the words

work page
[34]

Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Chinese Contrast demographic: Russian Topic: cooking ### Example Output: Reasoning: Chinese cooking culture emphasizes diverse regional cuisines like Sichuan, Cantonese, Hunan, and Shanghainese. It includes techniques such as stir-frying, steaming, br...

work page
[35]

genetics

liangpi . . . ### Task Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} ### Task Output Figure 18: Prompt when both demographics and topic is given. ### Task Details: You are given a topic and two demographics; a target demographic and contrast demographic. Your task is to come up with 15 cultural, sociological, or lingu...

work page
[36]

Theory 1: <your first theory> Keywords and Phrases: <word>, <phrase>,

work page
[37]

### Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} Figure 19: Prompt when both demographics and topic is given to generate creative lexicon

Theory 15: <your last theory> Keywords and Phrases: <word>, <phrase>, ... ### Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} Figure 19: Prompt when both demographics and topic is given to generate creative lexicon. 21

work page

[1] [1]

Neu- ral Computing and Applications, 35:5113–5144

A systematic review of machine learning tech- niques for stance detection and its applications. Neu- ral Computing and Applications, 35:5113–5144. American Jewish Committee. 2012. 2012 AJC Survey of American Jewish Opinion: Data Summary. Data report. Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate

work page 2012

[2] [2]

Political Analysis, 31(3):337–351

Out of one, many: Using language mod- els to simulate human samples. Political Analysis, 31(3):337–351. Sören Arlt, Carlos Ruiz-Gonzalez, and Mario Krenn

work page

[3] [3]

_eprint: 2210.09981

Digital Discovery of a Scientific Concept at the Core of Experimental Quantum Optics. _eprint: 2210.09981. Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. The AI Magazine, 36(1):15–24. Puput Puji Astuti. 2018. THE USE OF AFRICAN- AMERICAN VERNACULAR ENGLISH (AA VE) IN LOGIC’S EVERYBODY. Jinan C. Banna...

work page arXiv 2015

[4] [4]

Ethnic English

The influence of linguistic form and causal ex- planations on the development of social essentialism. Cognition, 229:105246. Sarah Bunin Benor and Steven M Cohen. Talking Jew- ish: The “Ethnic English” of American Jews. S.B. Benor. 2012. Becoming Frum: How Newcom- ers Learn the Language and Culture of Orthodox Judaism. Jewish Cultures of the World. Rutger...

work page doi:10.1080/14708470903267384 2012

[5] [5]

In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), Online

Interpretation of NLP models through input marginalization. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), Online. Association for Com- putational Linguistics. Mingyang Li, Louis Hickman, Louis Tay, Lyle Ungar, and Sharath Chandra Guntuku. 2020. Studying Po- liteness across Cultures Using English Twitt...

work page arXiv 2020

[6] [6]

A Unified Approach to Interpreting Model Predictions

Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362. Chen Cecilia Liu, Iryna Gurevych, and Anna Korho- nen. 2025. Culturally aware and adapted ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1002/9780470754856.ch2 2021

[7] [7]

arXiv preprint

Enhancing Creativity in Large Language Mod- els through Associative Thinking Strategies. arXiv preprint. ArXiv:2405.06715 [cs]. Moran Mizrahi, Chen Shani, Gabriel Stanovsky, Dan Ju- rafsky, and Dafna Shahaf. 2025. Cooking Up Creativ- ity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations. arXiv preprint. ArXiv...

work page arXiv 2025

[8] [8]

we demand justice!

"we demand justice!": Towards social context grounding of political texts. Preprint, arXiv:2311.09106. Gal Raayoni, Shahar Gottlieb, Yahel Manor, George Pisha, Yoav Harris, Uri Mendlovic, Doron Haviv, Yaron Hadad, and Ido Kaminer. 2021. Generating conjectures on fundamental constants with the Ra- manujan Machine. Nature, 590(7844):67–73. Pub- lisher: Natu...

work page arXiv 2021

[9] [9]

Why Should I Trust You?

“Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Demonstrations, San Diego, California. Association for Computational Linguistics. Anjali Roy. 2011. Meanings of Bhangra and Bolly- wood Dancing in India and the Dia...

work page 2016

[10] [10]

Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf- sky, Noah A

Whose opinions do language models reflect? Preprint, arXiv:2303.17548. Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf- sky, Noah A. Smith, and Yejin Choi. 2020. Social bias frames: Reasoning about social and power im- plications of language. Preprint, arXiv:1911.03891. Seth J. Schwartz, Byron L. Zamboanga, and Robert S. Weisskirch. 2008. Broadening t...

work page doi:10.1111/j.1751- 2020

[11] [11]

discus- sion of Christmas trees on the moon fighting over a purple golf club

Inducing a lexicon of sociolinguistic variables from code-mixed text. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 1–6, Brussels, Bel- gium. Association for Computational Linguistics. Geneva Smitherman. 2007. African American English. GRIN Verlag. Ian Stewart. 2014. Now We Stronger than Ever: Africa...

work page 2018

[12] [12]

What kinds of words/phrases might Demographic A use that De- mographic B would not? Specifically, we care about such words/phrases that are not obvious, or unexpected

First think about what you know about the two demographics A and B, especially when they talk about the given topic. What kinds of words/phrases might Demographic A use that De- mographic B would not? Specifically, we care about such words/phrases that are not obvious, or unexpected. Example: When talking about "recipes", Indian people when contrasted wit...

work page

[13] [13]

promising

Once you are ready, think about how these keywords compare to what you came up with. Were you surprised that the keywords worked in distin- guishing the two groups? To measure the consistency and quality of the an- notations, we calculated the Intraclass Correlation Coefficient. Using the ‘ICC(2,k)’ two-way random effects model, which assesses the reliabi...

work page

[14] [15]

Otherwise output fewer but high quality words

Output as many words and phrases as you think are appropriate: – If there are many differentiating things, output many words. Otherwise output fewer but high quality words

work page

[15] [17]

core socialist values

Don’toutput reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Chinese Contrast demographic: Russian ### Example Output: The target demographic is Chinese, and the contrast demographic is Russian. The goal is to find English-language words and phrases that appear in posts by people from the Chinese demographic but...

work page

[16] [18]

Spring Festival Gala

work page

[17] [19]

### Task Input: Target demographic: {target} Contrast demographic: {contrast} ### Task Output: Figure 15: Prompt when 2 demographics are given

C-pop ... ### Task Input: Target demographic: {target} Contrast demographic: {contrast} ### Task Output: Figure 15: Prompt when 2 demographics are given. 19 ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same way as the ...

work page

[18] [23]

aragalaya

Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Sri Lankan ### Example Output: Sri Lankans are a South Asian demographic with diverse linguistic and cultural backgrounds, primarily Sinhala and Tamil speakers, but English is widely used in online posts, especially among urban youth and diaspora comm...

work page

[19] [24]

### Task Input: Target demographic: {target} ### Task Output: Figure 16: Prompt when only target demographics is given

Ranil ... ### Task Input: Target demographic: {target} ### Task Output: Figure 16: Prompt when only target demographics is given. ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same way as the example provided. ### Task ...

work page

[20] [25]

they might use

ONLY output words (and phrases) that can be directly searched. No theories like “they might use...”

work page

[21] [28]

Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Sri Lankan Topic: cricket ### Example Output: Sri Lankans are deeply passionate about cricket—it’s the most popular sport in the country and a major source of national pride. Sri Lankan cricket fans often reference legendary players, local teams, and ...

work page

[22] [29]

Lankan fighting spirit

work page

[23] [30]

### Task Input: Target demographic: {target} Topic: {topic} ### Task Output Figure 17: Prompt when target demographic and topic is given

Proud to be Sri Lankan ... ### Task Input: Target demographic: {target} Topic: {topic} ### Task Output Figure 17: Prompt when target demographic and topic is given. 20 ### Task Overview: You are a socio-linguistic scientist. You will answer questions posed by the user, taking into consideration every detail of their request. Format the output in the same ...

work page

[24] [31]

they might use

ONLY output words (and phrases) that can be directly searched. No theories like "they might use..."

work page

[25] [32]

Otherwise output less but high quality words

Output as many words and phrases as you think are appropriate: - If there are many differentiating things, output many words. Otherwise output less but high quality words

work page

[26] [33]

Mention out loud what you know about them and then generate the words

First think about the demographics given out loud. Mention out loud what you know about them and then generate the words

work page

[27] [34]

Don’t output reasoning with the words, just the words and phrases. ### Example Input: Target demographic: Chinese Contrast demographic: Russian Topic: cooking ### Example Output: Reasoning: Chinese cooking culture emphasizes diverse regional cuisines like Sichuan, Cantonese, Hunan, and Shanghainese. It includes techniques such as stir-frying, steaming, br...

work page

[28] [35]

genetics

liangpi . . . ### Task Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} ### Task Output Figure 18: Prompt when both demographics and topic is given. ### Task Details: You are given a topic and two demographics; a target demographic and contrast demographic. Your task is to come up with 15 cultural, sociological, or lingu...

work page

[29] [36]

Theory 1: <your first theory> Keywords and Phrases: <word>, <phrase>,

work page

[30] [37]

### Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} Figure 19: Prompt when both demographics and topic is given to generate creative lexicon

Theory 15: <your last theory> Keywords and Phrases: <word>, <phrase>, ... ### Input: Target demographic: {target} Contrast demographic: {contrast} Topic: {topic} Figure 19: Prompt when both demographics and topic is given to generate creative lexicon. 21

work page