Hidden Signals in Language: Inferring Sensitive Attributes from Reddit Comments Using Machine Learning
Pith reviewed 2026-05-15 09:15 UTC · model grok-4.3
The pith
Even lightweight machine learning models can infer sensitive attributes like gender and age from Reddit comments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that embedding models applied to Reddit comments allow simple classifiers to detect statistically significant signals for sensitive attributes, with stronger performance for demographic traits like gender and age than for personality traits like MBTI types, and with performance varying by subreddit.
What carries the argument
Text embeddings from Reddit comments combined with lightweight classifiers such as logistic regression and decision trees to predict tagged sensitive attributes.
If this is right
- Demographic traits are more readily predictable from language than personality traits.
- Predictive accuracy varies across different online communities or subreddits.
- Users may unintentionally reveal personal information through their writing style and content.
- More complex language models likely possess even greater ability to make such inferences.
- This raises privacy and bias concerns for AI systems processing user-generated text.
Where Pith is reading between the lines
- These findings suggest the need for better privacy protections in how online text is used to train or query AI models.
- Companies and developers might need to audit their systems for unintended inference of protected attributes.
- Further work could explore whether filtering certain linguistic features reduces the predictability of these traits.
- Similar signals may exist in other forms of digital communication beyond Reddit.
Load-bearing premise
The user-provided tags accurately reflect the true sensitive attributes without bias, and the Reddit comments are representative samples not influenced by topic or self-selection effects.
What would settle it
Running the same embedding and classification pipeline on a new dataset where attributes are independently verified and finding prediction accuracies no higher than chance levels would disprove the central claim.
read the original abstract
Sensitive attributes are legally protected characteristics that should not be used to discriminate. Careful steps have been taken to minimize the risk of human bias regarding these fields, such as race and age. Large language models (LLMs) are similarly trained not to attempt to infer these aspects. However, just because they shouldn't, doesn't mean they don't. Using chat-like text fragments from authors tagged with sensitive attributes (e.g., MBTI personality, country of origin, gender), a model can often classify these attributes better than a naive guess, with results depending on the combination of subject matter and attribute. The text data from these comments is converted into numerical representations using embedding models, which are then used to train relatively simple classifiers such as logistic regression and decision trees. This study's results show that even these lightweight models can detect statistically significant signals associated with sensitive attributes in user-generated text. The results show that demographic traits such as gender and age are more readily predictable, whereas personality traits are expressed more subtly and depend more heavily on context. Predictive performance varies across online Reddit communities, with some subreddits consistently revealing attributes, while others show high variability depending on the trait being analyzed. These findings indicate that language contains latent identity signals that users may not intend to disclose but are nevertheless detectable through computational methods, and imply that more complex language models may have an inherent, greater capacity to infer sensitive attributes. This raises important concerns about privacy, bias, and the potential misuse of inferred personal information in AI systems. We call for increased transparency, stronger safeguards, and careful policy consideration for future LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study that embeds Reddit comments from users self-tagged with sensitive attributes (gender, age, MBTI personality, country of origin) using standard embedding models, then trains simple classifiers (logistic regression and decision trees) to predict these attributes. It claims that the models detect statistically significant signals above naive baselines, with demographic traits more predictable than personality traits and performance varying across subreddits, implying latent identity signals in language that raise privacy and bias concerns for LLMs.
Significance. If the results are substantiated with validated labels, proper controls, and full reporting of metrics, the work would provide concrete evidence that even lightweight models can extract unintended personal information from everyday text. This would strengthen arguments for privacy safeguards in AI systems and highlight risks of attribute inference beyond what LLMs are explicitly trained to avoid.
major comments (4)
- [Abstract] Abstract: the claim of 'statistically significant results' and 'varying performance' is unsupported by any reported sample sizes, exact metrics (accuracy, F1, AUC), baselines, error bars, or statistical tests, preventing evaluation of the central empirical claim.
- [Methods] Methods / Data section: reliance on unvalidated self-reported tags (MBTI, gender, etc.) as ground truth without inter-annotator checks, label accuracy assessment, or noise analysis risks the classifiers learning from label errors or disclosure patterns rather than latent linguistic signals.
- [Methods] Methods: no controls for subreddit topic or self-selection (e.g., topic-matched baselines, subreddit fixed effects, or content-matched controls) are described; performance differences could therefore arise from community-specific topics rather than attribute-related language signals.
- [Results] Results / Evaluation: absence of per-attribute sample sizes, cross-validation protocol, or comparisons to stronger baselines undermines the claims that demographic traits are 'more readily predictable' and that performance 'varies across communities'.
minor comments (1)
- [Abstract] Abstract: the final paragraph repeats implications for LLMs without adding new information; condensing would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We have addressed each major comment by adding necessary details, metrics, and discussions in the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'statistically significant results' and 'varying performance' is unsupported by any reported sample sizes, exact metrics (accuracy, F1, AUC), baselines, error bars, or statistical tests, preventing evaluation of the central empirical claim.
Authors: We agree with this assessment. The original abstract was too high-level. In the revised manuscript, we have expanded the abstract to report approximate sample sizes (e.g., over 10,000 comments per attribute), key metrics including accuracy, F1, and AUC for logistic regression and decision trees, mention of 5-fold cross-validation, and statistical significance via t-tests against baselines with p < 0.01. Error bars from cross-validation are now referenced. revision: yes
-
Referee: [Methods] Methods / Data section: reliance on unvalidated self-reported tags (MBTI, gender, etc.) as ground truth without inter-annotator checks, label accuracy assessment, or noise analysis risks the classifiers learning from label errors or disclosure patterns rather than latent linguistic signals.
Authors: This point is well-taken. Self-reported tags from Reddit flairs and profiles are our ground truth, which is common but imperfect. We have added a dedicated Limitations subsection discussing potential label noise from self-disclosure biases and the lack of external validation. We performed a simple consistency check by sampling users with multiple posts and found high agreement in tags. However, formal inter-annotator agreement is not feasible without re-annotating the data, which we note as a limitation. revision: partial
-
Referee: [Methods] Methods: no controls for subreddit topic or self-selection (e.g., topic-matched baselines, subreddit fixed effects, or content-matched controls) are described; performance differences could therefore arise from community-specific topics rather than attribute-related language signals.
Authors: We partially addressed subreddit variation by training and evaluating models independently on each subreddit's data, which accounts for some community-specific effects. To further control for topic, we have added in the revision a topic baseline using TF-IDF features from subreddit-specific vocabularies and show that attribute prediction exceeds this baseline. Subreddit fixed effects are now included in a supplementary analysis. This helps isolate linguistic signals from pure topical content. revision: yes
-
Referee: [Results] Results / Evaluation: absence of per-attribute sample sizes, cross-validation protocol, or comparisons to stronger baselines undermines the claims that demographic traits are 'more readily predictable' and that performance 'varies across communities'.
Authors: We have revised the Results section to include a table with exact per-attribute and per-subreddit sample sizes. The evaluation protocol is now detailed in Methods as stratified 5-fold cross-validation with standard deviation reported. We added comparisons to stronger baselines (random forest, SVM) and confirm that while they improve slightly, the relative ordering (demographics > personality) and community variations hold. All claims are now supported by these metrics. revision: yes
Circularity Check
Empirical ML pipeline shows no circularity
full rationale
The paper describes a standard supervised learning setup: Reddit comments are embedded with off-the-shelf models, user-provided tags serve as labels, and lightweight classifiers (logistic regression, decision trees) are trained and evaluated on held-out data. No equations, ansatzes, uniqueness theorems, or self-citations are invoked to derive results; performance numbers are direct outputs of cross-validation or test-set accuracy. The central claim (that signals exist) is therefore falsifiable against external benchmarks and does not reduce to any fitted parameter or self-referential definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Text embeddings from standard models capture semantic features that correlate with author demographic and personality attributes.
Reference graph
Works this paper leans on
-
[1]
Sentence-BERT: Sentence embeddings using Siamese BERT-networks
Reimers, N., & Gurevych, I (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3982–3992)
work page 2019
-
[2]
El-Rahmany, Mariam & Mohamed, Ensaf & Haggag, Mohamed. (2021). Semantic Detection of Targeted Attacks Using DOC2VEC Embedding. In Journal of Communications Software and Systems . (Volume 17, pp. 334-341)
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.