pith. machine review for the scientific record. sign in

arxiv: 2601.06316 · v3 · submitted 2026-01-09 · 💻 cs.CL

Recognition: no theorem link

Annotating Dimensions of Social Perception in Text: A Sentence-Level Dataset of Warmth and Competence

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords warmthcompetencesocial perceptiondatasetannotationNLPsocial mediatrust
0
0 comments X

The pith

The first sentence-level dataset annotates warmth and competence in social media text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces W&C-Sent, a collection of over 1,600 English sentence-target pairs drawn from social media posts that express opinions about specific people or groups. Each sentence is labeled for trust and sociability, which together make up warmth, plus a separate competence score. The authors explain the full process of gathering the posts, running crowd annotations, and applying quality controls. They also test several large language models on the task of identifying these three dimensions from raw text. The resource moves beyond existing word-level lists by capturing how the constructs appear in full sentences and discourse.

Core claim

We introduce Warmth and Competence Sentences (W&C-Sent), the first sentence-level dataset annotated for warmth and competence. The dataset includes over 1,600 English sentence-target pairs annotated along three dimensions: trust and sociability (components of warmth), and competence. The sentences in W&C-Sent are social media posts that express attitudes and opinions about specific individuals or social groups.

What carries the argument

W&C-Sent, a sentence-level annotation resource that labels social media text for the psychological dimensions of trust, sociability, and competence.

If this is right

  • NLP systems can now model contextual expression of social perceptions instead of relying only on word-level lexicons.
  • Large language models can be evaluated and improved on the specific task of detecting trust, sociability, and competence.
  • Computational social science gains a new tool for studying how language encodes attitudes toward individuals and groups.
  • The dataset supports development of applications that analyze social media for expressions of these dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained on the data could flag stereotypical language in online discussions about social groups.
  • The same annotation approach could be applied to longer texts or multi-turn conversations to study how perceptions evolve.
  • Cross-cultural or cross-platform extensions of the dataset would allow comparisons of how warmth and competence are expressed in different societies.

Load-bearing premise

Crowd-worker sentence annotations reliably and validly reflect the established psychological constructs of warmth and competence.

What would settle it

If inter-annotator agreement scores are low or if the labels show no correlation with independent psychological measures of warmth and competence, the dataset would fail to capture the intended constructs.

Figures

Figures reproduced from arXiv: 2601.06316 by Mutaz Ayesh, Nedjma Ousidhoum, Saif M. Mohammad.

Figure 1
Figure 1. Figure 1: Examples showing divergences between di [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example illustrating how the SemEval-2016 Stance dataset was used to extract sentence–target pairs [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The distribution of median scores of our [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of discretized mean-based labels, [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A figure showing the distribution of fine￾grained unanimous judgments, across targets and di￾mensions. The figure shows how abundant the agree￾ment is on sentences whose targets are Clinton and Trump in the trust dimension, and the complete absence of the target Environmentalists. H.1.2 Soft Unanimity This refers to cases where annotators agreed on the overall polarity (low, neutral, or high) of a sentence… view at source ↗
Figure 6
Figure 6. Figure 6: The distribution of soft unanimous judgments across targets and dimensions. The figure shows the overwhelming soft unanimous agreement for the trust and sociability of Clinton and Trump. across two factors to be quickly visualized as col￾ors within a matrix. The colors in all three grids show strong co-occurrence between median neu￾tral scores; meaning, when competence within a sentence is judged as neutra… view at source ↗
Figure 7
Figure 7. Figure 7: Heat maps of correlation matrices for each [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The SHR scores of each target, in each rel [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
read the original abstract

Warmth (W) (often further broken down intoTrust (T) and Sociability (S)) and Competence (C) are central dimensions along which people evaluate individuals and social groups (Fiske, 2018). While these constructs are well established in social psychology, they are only starting to get attention in NLP research through word-level lexicons, which do not fully capture their contextual expression in larger text units and discourse. In this work, we introduce Warmth and Competence Sentences (W&C-Sent), the first sentence-level dataset annotated for warmth and competence. The dataset includes over 1,600 English sentence--target pairs annotated along three dimensions: trust and sociability (components of warmth), and competence. The sentences in W&C-Sent are social media posts that express attitudes and opinions about specific individuals or social groups (the targets of our annotations). We describe the data collection, annotation, and quality-control procedures in detail, and evaluate a range of large language models (LLMs) on their ability to identify trust, sociability, and competence in text. W&C-Sent provides a new resource for analyzing warmth and competence in language and supports future research at the intersection of NLP and computational social science.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces W&C-Sent, the first sentence-level dataset of over 1,600 English sentence-target pairs drawn from social media posts and annotated for trust, sociability (as components of warmth), and competence; it details the data collection, annotation, and quality-control procedures and reports evaluations of LLMs on identifying these dimensions in text.

Significance. If the annotations prove reliable, the dataset would constitute a useful new resource that moves beyond existing word-level lexicons to contextual sentence-level annotations of established social-psychology constructs, supporting future work at the intersection of NLP and computational social science on social perception and bias in language.

major comments (1)
  1. [Abstract] Abstract: the description of annotation and quality-control procedures provides no quantitative inter-annotator agreement scores (e.g., Cohen's kappa or Krippendorff's alpha), no validation against established psychological scales, and no statement on dataset release status; these omissions leave the central claim that the annotations reliably capture the warmth and competence constructs only moderately supported.
minor comments (1)
  1. [Abstract] Abstract: replace the approximate size 'over 1,600' with the exact count of sentence-target pairs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and recommendation for minor revision. We address the comment on the abstract below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description of annotation and quality-control procedures provides no quantitative inter-annotator agreement scores (e.g., Cohen's kappa or Krippendorff's alpha), no validation against established psychological scales, and no statement on dataset release status; these omissions leave the central claim that the annotations reliably capture the warmth and competence constructs only moderately supported.

    Authors: We agree that the abstract should include quantitative support for annotation reliability and a statement on data availability. The full manuscript reports inter-annotator agreement using Krippendorff's alpha for each dimension along with detailed quality-control procedures; we will summarize these scores in the revised abstract. We will also add that the dataset will be released publicly upon publication. Regarding validation against established psychological scales, the annotations follow the theoretical framework of warmth and competence from social psychology (Fiske, 2018), with guidelines developed to capture these constructs at sentence level. No additional empirical validation with scales was conducted in this work, as the contribution centers on creating the sentence-level resource and LLM evaluation; we will clarify this grounding in the abstract to strengthen the reliability claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an annotated dataset (W&C-Sent) grounded in external social-psychology literature (Fiske 2018) and new crowd annotations on social-media sentences. No equations, parameter fitting, predictions, or self-citation chains appear; the central contribution is empirical data creation with described collection and quality-control steps. All load-bearing elements (construct definitions, annotation guidelines) reference independent prior work rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the established validity of warmth and competence dimensions from prior social psychology and on standard human annotation practices; no free parameters, invented entities, or ad-hoc axioms are introduced.

axioms (1)
  • domain assumption Warmth (trust and sociability) and competence are valid, measurable dimensions of social perception that can be reliably annotated at the sentence level.
    Invoked via citation to Fiske 2018 and the decision to annotate sentences rather than words.

pith-pipeline@v0.9.0 · 5529 in / 1141 out tokens · 32991 ms · 2026-05-16T15:30:27.032819+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

  1. [1]

    Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach

    Facets of the Fundamental Content Dimen- sions: Agency with Competence and Assertive- ness—Communion with Warmth and Morality.Fron- tiers in Psychology, V olume 7 - 2016. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. InPro- ceedings of the 58th Annual Meeting of ...

  2. [2]

    Lost in the Middle: How Language Models Use Long Contexts

    Erratum in: J Pers Soc Psychol. 2024 Mar;126(3):412. Kathleen Fraser, Svetlana Kiritchenko, and Isar Ne- jadgholi. 2024. How does stereotype content differ across data sources? InProceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024), pages 18–34, Mexico City, Mexico. Association for Computational Linguistics. Gemma Te...

  3. [3]

    Gandalf Nicolas, Xuechunzi Bai, and Susan T

    BERTweet: A pre-trained language model for English Tweets.Preprint, arXiv:2005.10200. Gandalf Nicolas, Xuechunzi Bai, and Susan T. Fiske

  4. [4]

    Comprehensive stereotype content dictionaries using a semi-automated method.European Journal of Social Psychology, 51(1):178–196. OpenAI. 2023. GPT-4 technical report.CoRR, abs/2303.08774. Nedjma Ousidhoum, Xinran Zhao, Tianqing Fang, Yangqiu Song, and Dit-Yan Yeung. 2021. Probing toxic content in large pre-trained language models. InProceedings of the 59...

  5. [5]

    InProceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing, pages 18601–18619, Miami, Florida, USA

    Who is better at math, jenny or jingzhen? uncovering stereotypes in large language models. InProceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing, pages 18601–18619, Miami, Florida, USA. Association for Computational Linguistics. Zoltán Gendler Szabó. 2024. Compositionality. In Edward N. Zalta and Uri Nodelman, editors,...

  6. [6]

    current country of residence

    Countries.The first two screener sets se- lected were the “current country of residence" and the “country of birth". Since the success of the task hinges on fluency in English, the countries selected in this set were those that mainly speak English, so the following countries were selected in both sets: Antigua and Barbuda, Australia, Barbados, Belize, Ca...

  7. [7]

    first language

    Languages.Prolific offers three screener sets related to the languages that annotators speak. Those are “first language", “primary language", and “fluent languages". Those were all set to “English". Prolific displays the number of eligible partic- ipants after each screener set is applied, and the number decreases with every additional filter

  8. [8]

    highest education level completed

    Education.For most screener sets regard- ing “highest education level completed", eligible participants were limited to those with at least a technical or community college degree, ranging up through undergraduate, graduate, and doctoral qualifications

  9. [9]

    #SemST” was removed. “@tedcruz

    Approval Rate and Participation.The ap- proval rate and previous participation criteria were also important in shaping the pool of annotators. By requiring a 99-100% approval rate, I attempted to minimize the risk of low-quality or careless re- sponses, admitting only participants with an al- most impeccable record who completed tasks to researchers’ sati...

  10. [10]

    Attempt these questions only if you are fluent in English

  11. [11]

    Your responses are confidential

  12. [12]

    Let your instinct guide you; don’t overthink it

    There is a degree of subjectivity in this task. Let your instinct guide you; don’t overthink it

  13. [13]

    Consider the entire meaning of the sentence before attempting to give the relevant scores

  14. [14]

    Your views regarding any of the entities or topics in the texts (such as political parties, individuals, social groups) should not affect your scores

  15. [15]

    While occasional deviations are acceptable given the subjectivity of this task, contributions may be rejected if a considerable number of these questions are answered incorrectly

    To ensure fairness and the validity of our scientific findings, some questions (typically unambiguous ones!) have predetermined answer ranges. While occasional deviations are acceptable given the subjectivity of this task, contributions may be rejected if a considerable number of these questions are answered incorrectly. Reading the guidelines below is th...

  16. [16]

    What is the degree of trust towards this tar- getthatthe authorof the text seems to ex- press? Doesthe authorseem to perceive the target as trustworthy or untrustworthy / moral or immoral / honest or dishonest?

  17. [17]

    What is the degree of sociability towards this targetthatthe authorof the text seems to express? Doesthe authorseem to perceive the target as sociable or antisocial? Helpful or unhelpful?

  18. [18]

    What is the degree of competence towards this targetthatthe authorseems to express? Doesthe authorseem to perceive the target as in control or out of control? Active or pas- sive? Powerful or weak? Notes:

  19. [19]

    There are select examples in the next page, accompanied by an explanation of the scores given for each example

  20. [20]

    positive

    Adhere to the literal meaning of competence, which may be “positive" (e.g., a CEO) or “neg- ative" (e.g., a villain or a dictator). Both types are considered “competence", regardless of the outcomes. Example 3 is an example of that.4

  21. [21]

    All sentences were carefully chosen5

    There are no repeated sentences in this study. All sentences were carefully chosen5

  22. [22]

    3Only the relevant dimension was included in each dimension-specific guidelines

    Even if the speaker is explicitly expressing opinions towards X, if the target listed is Y , then we want to know the degree of trust, so- ciability, and competence6 towards Y only. 3Only the relevant dimension was included in each dimension-specific guidelines.. 4This point only appeared in the competence-specific guidelines. 5This part was added after a...

  23. [23]

    Your views regarding any of the entities or topics in the texts (such as political parties, individuals, social groups) should not affect your scores

    Try to be objective. Your views regarding any of the entities or topics in the texts (such as political parties, individuals, social groups) should not affect your scores

  24. [24]

    You can add any comments, thoughts, or justifications you may have on the scores you gave

    There is an optional free-form text field un- derneath each instance. You can add any comments, thoughts, or justifications you may have on the scores you gave

  25. [25]

    See task details

    You will have these guidelines available to you at every stage of the task by pressing on “See task details" on the top right. Examples7 Example 1: Target: Women Text: “My wife is themost caringperson I’ve ever met ... she’s the only woman in a house full of testosterone . Shenever stops workingwhether it’s at home or being an RN .I cant keep upbut I try ...

  26. [26]

    women suck

    One might say that “women suck" expresses a very negative sentiment towards women’s trustworthiness and social likeability. This might affect the scores accordingly

  27. [27]

    One might claim that the competence of women isn’t really addressed since the author frames the sister’s decision-making negatively rather than women as a group

  28. [28]

    I need feminism because the United States is one of the only countries that doesn’t give paid maternity leave

    Consider the fact that the gender of the author is not explicit. How might it affect your scores if the author of the post were a woman? That is up to you to decide. Example 5: Target: Women Text: “I need feminism because the United States is one of the only countries that doesn’t give paid maternity leave." Trust: 0 (neutral). The author’s statement is f...

  29. [29]

    This is another example that shows that your political views regarding Donald Trump must not influence your score. Supporters of Trump might see this as unfair or even read stubborn- ness as positive determination, while critics might view it as more damning commentary on his interpersonal difficulties

  30. [30]

    Could all those who believe in a godplease leave. The meeting will now continue for the grown ups only

    Other interpretations can be just as valid. One might argue that a 0 score for competence (neutral) is appropriate; generally speaking, stubbornness is a character trait that doesn’t directly relate to competence or incompetence. While it can sometimes imply determination (positive for competence), in this context it’s more about being inflexible or diffi...

  31. [31]

    MAE and RMSE, for error magnitude

  32. [32]

    Spearman ρ, for correlation with human rat- ings

  33. [33]

    Accuracy, to observe the exact matches

  34. [34]

    Macro F1 score, which measures the overall balance between precision and recall across all classes

  35. [35]

    Text" and “Target

    And the within-1-bin accuracy, which mea- sures how often model predictions fall within one rating level of the true labels captures near-miss performance in ordinal tasks L Classification Models L.1 Experimental Setup As a classification task, this approach allows the models to directly learn decision thresholds be- tween categories. As previously shown ...

  36. [37]

    reason". Then, based on your justification, add your rating to a key called “label

    Does the author seem to perceive { } as trustworthy or untrustworthy / moral or immoral / honest or dishonest? Remember: even if the speaker is explicitly targeting someone else, since the target is { }, your score should be an assessment of the trust towards { } only. In the format of a JSON file or a Python dictionary, you should provide your justificat...

  37. [39]

    reason". Then, based on your justification, add your rating to a key called “label

    Does the author seem to perceive { } as sociable or antisocial? Helpful or unhelpful? Remember: even if the speaker is explicitly target- ing someone else, since the target is { }, your score should be an assessment of perceived sociability trust towards { } only. In the format of a JSON file or a Python dictionary, you should provide your justification s...

  38. [41]

    reason". Then, based on your justification, add your rating to a key called “label

    Does the author seem to perceive { } in control or out of control? Active or passive? Powerful or weak? Remember: even if the speaker is explicitly tar- geting someone else, since the target is { }, your score should be an assessment of the competence towards { } only. In the format of a JSON file or a Python dictionary, you should provide your justificat...

  39. [43]

    reason". Then, based on your justification, add your rating to a key called “label

    Does the author seem to perceive { } as trustworthy or untrustworthy / moral or immoral / honest or dishonest? Remember: even if the speaker is explicitly targeting someone else, since the target is { }, your score should be an assessment of the trust towards { } only. In the format of a JSON file or a Python dictionary, you should provide your justificat...

  40. [45]

    reason". Then, based on your justification, add your rating to a key called “label

    Does the author seem to perceive { } as sociable or antisocial? Helpful or unhelpful? Remember: even if the speaker is explicitly target- ing someone else, since the target is { }, your score should be an assessment of perceived sociability trust towards { } only. In the format of a JSON file or a Python dictionary, you should provide your justification s...

  41. [47]

    reason". Then, based on your justification, add your rating to a key called “label

    Does the author seem to perceive { } in control or out of control? Active or passive? Powerful or weak? Remember: even if the speaker is explicitly tar- geting someone else, since the target is { }, your score should be an assessment of the competence towards { } only. In the format of a JSON file or a Python dictionary, you should provide your justificat...

  42. [49]

    label". You should provide your label in a JSON object whose key is called

    Does the author seem to perceive { } as trustworthy or untrustworthy / moral or immoral / honest or dishonest? Remember: even if the speaker is explicitly targeting someone else, since the target is { }, your score should be an assessment of the trust towards { } only. You should analyse the meaning, then, based on your analysis, add your rating to a key ...

  43. [51]

    la- bel". You should provide your label in a JSON object whose key is called

    Does the author seem to perceive { } as sociable or antisocial? Helpful or unhelpful? Remember: even if the speaker is explicitly target- ing someone else, since the target is { }, your score should be an assessment of perceived sociability trust towards { } only. You should analyse the meaning, then, based on your analysis, add your rating to a key calle...

  44. [53]

    label". You should provide your label in a JSON object whose key is called

    Does the author seem to perceive { } in control or out of control? Active or passive? Powerful or weak? Remember: even if the speaker is explicitly targeting someone else, since the target is { }, your score should be an assessment of the competence towards { } only. You should analyse the meaning, then, based on your analysis, add your rating to a key ca...

  45. [54]

    What is the degree of trust towards { } that the author of the text seems to express?

  46. [55]

    label". You should provide your label in a JSON object whose key is called

    Does the author seem to perceive { } as trustworthy or untrustworthy / moral or immoral / honest or dishonest? Remember: even if the speaker is explicitly targeting someone else, since the target is { }, your score should be an assessment of the trust towards { } only. You should analyse the meaning, then, based on your analysis, add your rating to a key ...

  47. [56]

    What is the degree of sociability towards { } that the author of the text seems to express?

  48. [57]

    la- bel". You should provide your label in a JSON object whose key is called

    Does the author seem to perceive { } as sociable or antisocial? Helpful or unhelpful? Remember: even if the speaker is explicitly target- ing someone else, since the target is { }, your score should be an assessment of perceived sociability trust towards { } only. You should analyse the meaning, then, based on your analysis, add your rating to a key calle...

  49. [58]

    What is the degree of competence towards { } that the author of the text seems to express?

  50. [59]

    label". You should provide your label in a JSON object whose key is called

    Does the author seem to perceive { } in control or out of control? Active or passive? Powerful or weak? Remember: even if the speaker is explicitly targeting someone else, since the target is { }, your score should be an assessment of the competence towards { } only. You should analyse the meaning, then, based on your analysis, add your rating to a key ca...