pith. sign in

arxiv: 2602.17262 · v2 · submitted 2026-02-19 · 💻 cs.CL · stat.ME

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Pith reviewed 2026-05-15 21:21 UTC · model grok-4.3

classification 💻 cs.CL stat.ME
keywords socially desirable respondingLLM questionnaire evaluationgraded forced-choiceBig Five inventoryresponse biaspsychometric assessmentsynthetic personasinstruction following
0
0 comments X

The pith

Desirability-matched graded forced-choice questionnaires reduce socially desirable responding in LLMs while preserving recovery of intended persona profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard Likert questionnaires used to assess LLMs for personality traits, bias, and safety produce large socially desirable responding effects when models are given fake-good instructions. The paper quantifies this bias as a standardized effect size from IRT latent scores by comparing honest and faking conditions on synthetic personas with known targets. It then builds a graded forced-choice Big Five inventory by pairing items from different domains through constrained optimization so that each pair has matched desirability. Across nine instruction-following LLMs, the forced-choice format substantially lowers SDR compared with Likert scales and still recovers the target profiles for most models. The results point to a model-specific trade-off between bias reduction and profile fidelity.

Core claim

When the same personality inventory is given under honest versus fake-good instructions, Likert-style items produce consistently large SDR effect sizes, whereas a desirability-matched graded forced-choice version attenuates those effects while still allowing recovery of the synthetic target profiles in the majority of the nine tested models.

What carries the argument

A graded forced-choice Big Five inventory whose 30 cross-domain item pairs were chosen by constrained optimization to equalize social desirability.

If this is right

  • SDR can be measured and compared across different questionnaire constructs using direction-corrected IRT effect sizes.
  • Desirability-matched forced-choice formats lower SDR in LLM evaluations relative to standard Likert scales.
  • Profile recovery remains largely intact for most models under the forced-choice format.
  • Benchmarking and auditing of LLMs with questionnaires should include SDR-aware controls or alternative formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar forced-choice designs could be adapted for safety or bias audits that currently rely on self-report items.
  • The observed model-dependent trade-off implies that no single format will be optimal for every LLM and may require per-model calibration.
  • Extending the method beyond the Big Five to other inventories would test whether the SDR reduction generalizes.

Load-bearing premise

The assumption that synthetic personas with known targets accurately represent how LLMs respond in real evaluation settings and that the optimization truly equalizes desirability without creating new response artifacts.

What would settle it

Re-administer the GFC inventory to the same LLMs using only natural prompts without any persona instruction and check whether SDR remains as low as in the synthetic tests.

read the original abstract

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-following LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce a psychometric framework for quantifying SDR in LLMs via IRT-based latent scores from honest vs. fake-good instruction conditions on questionnaires, and to mitigate it by constructing a desirability-matched graded forced-choice (GFC) Big Five inventory through constrained optimization over cross-domain item pairs. Across nine instruction-following LLMs tested on synthetic personas with known target profiles, Likert formats exhibit large SDR while the GFC format substantially reduces it with largely preserved profile recovery, highlighting a model-dependent trade-off.

Significance. If the central results hold after addressing the noted gaps, the work would be significant for NLP evaluation practices: it supplies a reusable, cross-construct metric for SDR and a concrete mitigation format that could improve the validity of questionnaire-based auditing for persona consistency, safety, and bias. The explicit comparison to human instructed-faking benchmarks and the emphasis on model-dependent recovery trade-offs are strengths that could influence reporting standards in LLM benchmarking.

major comments (3)
  1. [Abstract / Methods] Abstract and Methods section: the exact IRT model (e.g., graded response model parameters, estimation procedure) is not specified, nor are the statistical significance tests or error bars on the reported SDR effect sizes; without these, the claims of 'consistently large SDR' and 'substantial attenuation' under GFC cannot be fully evaluated.
  2. [Methods (GFC construction)] GFC inventory construction: the constrained optimization used to select the 30 cross-domain pairs and enforce desirability matching is described only at a high level; the specific objective function, constraints, and any post-selection validation against response artifacts are missing, which directly affects the weakest assumption that the format equalizes desirability without introducing new biases.
  3. [Experiments / Results] Evaluation on synthetic personas: the headline result that GFC attenuates SDR while preserving target-profile recovery rests entirely on personas whose target profiles are explicitly instructed in the prompt; no ablation or comparison against open-ended persona descriptions or downstream-task prompts is reported, leaving open whether the observed SDR reduction generalizes beyond the injection method.
minor comments (2)
  1. [Results] The paper should include a table or figure reporting per-model SDR effect sizes with confidence intervals to allow readers to assess the model-dependent trade-off quantitatively.
  2. [Methods] Clarify the exact number of items per domain in the original pool and the precise desirability-matching criterion (e.g., absolute difference threshold) used in the optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our presentation. We address each major point below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods section: the exact IRT model (e.g., graded response model parameters, estimation procedure) is not specified, nor are the statistical significance tests or error bars on the reported SDR effect sizes; without these, the claims of 'consistently large SDR' and 'substantial attenuation' under GFC cannot be fully evaluated.

    Authors: We agree that the IRT specification was insufficiently detailed. In the revised Methods section we will explicitly describe the use of the graded response model (GRM) with item parameters estimated via marginal maximum likelihood in the mirt R package, including the specific discrimination and threshold parameters retained after calibration on the item pool. We will also add standard-error bars to all SDR effect-size plots and report the results of paired t-tests (with Bonferroni correction) comparing HONEST vs. FAKE-GOOD latent scores, thereby allowing readers to evaluate the statistical reliability of the 'large' and 'substantial attenuation' claims. revision: yes

  2. Referee: [Methods (GFC construction)] GFC inventory construction: the constrained optimization used to select the 30 cross-domain pairs and enforce desirability matching is described only at a high level; the specific objective function, constraints, and any post-selection validation against response artifacts are missing, which directly affects the weakest assumption that the format equalizes desirability without introducing new biases.

    Authors: We accept that the optimization procedure requires fuller specification. The revised manuscript will state the exact objective: minimize the sum of squared differences in mean desirability ratings (from an independent rater pool) across the 30 selected pairs while maximizing facet coverage under the constraint that each pair contains one item from each of two distinct Big Five domains and that no item is reused. We will also report the post-selection checks performed: verification that response-option distributions in a small pilot sample showed no extreme-response bias and that item-total correlations remained within acceptable bounds. These additions directly address the concern about unintended biases. revision: yes

  3. Referee: [Experiments / Results] Evaluation on synthetic personas: the headline result that GFC attenuates SDR while preserving target-profile recovery rests entirely on personas whose target profiles are explicitly instructed in the prompt; no ablation or comparison against open-ended persona descriptions or downstream-task prompts is reported, leaving open whether the observed SDR reduction generalizes beyond the injection method.

    Authors: The referee is correct that the current evaluation is limited to explicitly instructed target profiles. This design choice was deliberate to obtain ground-truth recovery metrics, yet it does leave open questions of generalization. In the revision we will add a new Limitations subsection that explicitly discusses the scope of the synthetic-persona paradigm and outlines planned follow-up experiments using open-ended descriptions and downstream-task prompts. Because the requested ablations would require new data collection and analysis beyond the present study, we will not perform them for this revision but will commit to addressing them in subsequent work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in SDR quantification or GFC construction

full rationale

The paper's core quantification of SDR is defined directly as the direction-corrected standardized effect size between IRT latent scores obtained under HONEST versus FAKE-GOOD instructions; this difference is computed from the data and does not reduce to any fitted parameter, self-citation, or ansatz by the paper's equations. The GFC inventory is assembled by constrained optimization that selects cross-domain pairs to equalize desirability, yet the subsequent empirical finding that GFC attenuates SDR while preserving persona-profile recovery is measured on held-out synthetic personas rather than being true by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or renamings of known results appear in the derivation chain. The overall procedure remains self-contained against external human faking benchmarks and does not collapse the central claims into their inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard IRT assumptions for scoring and the premise that desirability can be matched via optimization without side effects; no new entities are postulated.

free parameters (1)
  • optimization constraints for pair selection
    The constrained optimization that selects 30 cross-domain pairs to match desirability likely involves tunable thresholds or weights.
axioms (1)
  • domain assumption Item response theory model assumptions hold for LLM responses under different instructions
    Used to estimate latent scores from which SDR effect size is derived.

pith-pipeline@v0.9.0 · 5540 in / 1226 out tokens · 22048 ms · 2026-05-15T21:21:38.399899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.