Counterfactual Probing for the Influence of Affect and Specificity on Intergroup Bias

David I. Beaver; Junyi Jessy Li; Kyle Mahowald; Venkata S Govindarajan

arxiv: 2305.16409 · v2 · submitted 2023-05-25 · 💻 cs.CL · cs.CY

Counterfactual Probing for the Influence of Affect and Specificity on Intergroup Bias

Venkata S Govindarajan , Kyle Mahowald , David I. Beaver , Junyi Jessy Li This is my paper

Pith reviewed 2026-05-24 08:16 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords intergroup biasaffectspecificitycounterfactual probingNLP biaspragmatic featuressocial context in language

0 comments

The pith

Affect shapes how language models classify intergroup relationships in text, while specificity remains inconclusive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether two pragmatic features, affect and specificity, change systematically across different intergroup social contexts in language. It starts with modest observed correlations between these features and supervised intergroup relationship labels on tweets. Counterfactual probing of finetuned neural models then shows that the models depend on affect when making those classifications, but the role of specificity cannot be confirmed. This work links a revised framing of bias as intergroup social context to measurable language properties. A sympathetic reader would care because it moves bias analysis in NLP beyond pejorative word lists toward how everyday pragmatic choices reflect social groupings.

Core claim

The authors establish that affect and specificity show modest correlations with intergroup relationship labels, and that neural models finetuned to predict those labels reliably draw on affect during classification while the contribution of specificity stays inconclusive under counterfactual probing.

What carries the argument

Counterfactual probing of neural models finetuned on supervised intergroup relationship (IGR) labels, which isolates the contribution of affect and specificity by altering those features in input text.

If this is right

Affect functions as a reliable signal that models exploit when detecting intergroup social context.
Specificity does not produce a clear, consistent signal in the same models.
Pragmatic features beyond negative language can be connected to intergroup bias through controlled probing.
Finetuned models for IGR labels can be decomposed to reveal which linguistic properties drive their decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If affect is the dominant feature, debiasing methods could target affective language patterns rather than broad topic filters.
The inconclusive result for specificity suggests either that it is not used by these models or that current probing methods lack sensitivity to detect it.
Extending the same probing approach to other pragmatic features such as hedging or politeness could map a fuller set of linguistic markers for intergroup context.

Load-bearing premise

The supervised intergroup relationship labels accurately reflect intergroup social context without being shaped by the same affect and specificity features under study.

What would settle it

A test in which removing or neutralizing affect words from the input tweets causes no drop in a finetuned model's accuracy on the IGR prediction task.

read the original abstract

While existing work on studying bias in NLP focues on negative or pejorative language use, Govindarajan et al. (2023) offer a revised framing of bias in terms of intergroup social context, and its effects on language behavior. In this paper, we investigate if two pragmatic features (specificity and affect) systematically vary in different intergroup contexts -- thus connecting this new framing of bias to language output. Preliminary analysis finds modest correlations between specificity and affect of tweets with supervised intergroup relationship (IGR) labels. Counterfactual probing further reveals that while neural models finetuned for predicting IGR labels reliably use affect in classification, the model's usage of specificity is inconclusive. Code and data can be found at: https://github.com/venkatasg/intergroup-probing

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends the intergroup bias framing by probing affect and specificity in labeled tweets, but the evidence is thin and the label independence is untested.

read the letter

The paper extends the intergroup bias framing by probing affect and specificity in labeled tweets, but the evidence is thin and the label independence is untested. The new element is applying counterfactual probing to see how models use these two features when predicting intergroup relationship labels. They find modest correlations in the data and that affect comes through in the probes while specificity does not. This is a reasonable follow-up to the 2023 paper. Releasing the code and data is useful and lets others replicate the measurements. The soft spots are clear from the abstract. No sample sizes or statistical details are given, the specificity result is inconclusive, and there is no test of whether the IGR labels were created without influence from affect or specificity. That last point matters because if the labels already embed those signals, the probing results lose their force. The work stays empirical and does not have any formal derivations or machine-checked proofs. This paper is for people tracking developments in measuring bias through social context rather than just negative words. A reader working on similar probing methods could get some ideas from the setup. It deserves a serious referee because the extension is direct and the materials are public, though the authors should expect questions on the label construction process.

Referee Report

3 major / 2 minor

Summary. The paper investigates whether pragmatic features of specificity and affect systematically vary across intergroup social contexts in tweets, using supervised intergroup relationship (IGR) labels from prior work. Preliminary analysis reports modest correlations between these features and IGR labels. Counterfactual probing of neural models fine-tuned to predict IGR labels indicates reliable use of affect but inconclusive results for specificity. Code and data are released.

Significance. If the empirical measurements and probing results hold after addressing methodological gaps, the work would usefully connect a revised framing of intergroup bias to concrete language features, extending bias analysis beyond pejorative language. The release of code and data supports reproducibility and is a clear strength.

major comments (3)

[§3 and §4] §3 (Methods) and §4 (Results): The abstract and reported findings supply no sample sizes, statistical tests, p-values, or effect sizes for the 'modest correlations' or the counterfactual probing experiments. Without these, the reliability of the affect-usage claim and the inconclusive specificity result cannot be assessed.
[§2 and §4] §2 (Background) and §4: The manuscript does not test or discuss whether the supervised IGR labels themselves encode affect or specificity signals (e.g., via annotator heuristics). If present, this would make both the correlation analysis and the probing results circular, as the model would recover label-construction artifacts rather than independent intergroup context.
[§4] §4 (Counterfactual probing): No details are provided on how the counterfactual examples were constructed or validated (e.g., minimal edits, human verification of label preservation). This detail is load-bearing for interpreting the affect vs. specificity contrast.

minor comments (2)

[Abstract] Abstract: 'focues' is a typo for 'focuses'.
[References] The citation to Govindarajan et al. (2023) should be expanded with a full reference entry and a brief summary of their IGR labeling procedure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important methodological details needed to strengthen the paper. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [§3 and §4] §3 (Methods) and §4 (Results): The abstract and reported findings supply no sample sizes, statistical tests, p-values, or effect sizes for the 'modest correlations' or the counterfactual probing experiments. Without these, the reliability of the affect-usage claim and the inconclusive specificity result cannot be assessed.

Authors: We agree that the absence of these statistical details limits assessment of the findings. The revised manuscript will report sample sizes, statistical tests (e.g., correlation coefficients with p-values), and effect sizes for the preliminary correlation analyses as well as for the counterfactual probing results. revision: yes
Referee: [§2 and §4] §2 (Background) and §4: The manuscript does not test or discuss whether the supervised IGR labels themselves encode affect or specificity signals (e.g., via annotator heuristics). If present, this would make both the correlation analysis and the probing results circular, as the model would recover label-construction artifacts rather than independent intergroup context.

Authors: This concern is well-taken. The IGR labels are taken from prior work without additional validation in this manuscript for affect or specificity signals. The revision will add an explicit discussion of this potential circularity risk and include an analysis checking whether the labels encode such signals (e.g., via correlation with the measured features or annotator guideline review). revision: yes
Referee: [§4] §4 (Counterfactual probing): No details are provided on how the counterfactual examples were constructed or validated (e.g., minimal edits, human verification of label preservation). This detail is load-bearing for interpreting the affect vs. specificity contrast.

Authors: We acknowledge that the current manuscript lacks these construction and validation details. The revised version will provide a full description of the counterfactual generation process, including the criteria for minimal edits, the specific perturbations applied for affect and specificity, and any human validation steps performed to confirm that the intergroup relationship label is preserved. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement and probing on given labels

full rationale

The paper conducts preliminary correlation analysis between IGR labels and pragmatic features, followed by counterfactual probing of finetuned classifiers. No equations, fitted parameters, or derivations are present that reduce to their own inputs by construction. IGR labels are treated as an external supervised input; the analysis measures model behavior relative to those labels rather than deriving the labels or claims from the measured features. Self-citation to Govindarajan et al. (2023) supplies the framing and labels but is not load-bearing for any reduction of the reported probing results. This is a standard empirical study whose central claims remain independent of the inputs under test.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all quantities appear to be measured from data or model outputs.

pith-pipeline@v0.9.0 · 5676 in / 925 out tokens · 38329 ms · 2026-05-24T08:16:04.413779+00:00 · methodology

Counterfactual Probing for the Influence of Affect and Specificity on Intergroup Bias

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)