pith. sign in

arxiv: 2603.00059 · v2 · submitted 2026-02-10 · 💻 cs.CY · cs.AI

Stochastic Parrots or Singing in Harmony? Testing Five Leading LLMs for their Ability to Replicate a Human Survey with Synthetic Data

Pith reviewed 2026-05-16 05:26 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords synthetic dataLLMsurvey replicationorganizational researchAI generated responseshuman survey comparisonSilicon Valleyconventional wisdom
0
0 comments X

The pith

Leading LLMs generate plausible but harmonized synthetic survey data that misses the counterintuitive insights from human responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares responses from a survey of 420 real Silicon Valley coders and developers with synthetic responses generated by five leading LLMs to see how well AI can replicate human data. The models produced results that seemed reasonable and consistent with one another, yet they failed to reflect the unexpected patterns that gave the human survey its value. This points to LLMs reproducing shared conventional views rather than the full range of human variability in beliefs about work and organizations. The authors conclude that synthetic data works best as a supporting tool for surveys rather than a replacement, especially for topics without established prior findings.

Core claim

None of the five LLMs captured the counterintuitive insights from the human survey of organizational beliefs. Instead, the deviations across all models grouped together, leaving the real human data as the outlier. This shows that leading LLMs increasingly parrot conventional wisdom in harmony with each other rather than uncovering novel findings about human social beliefs within organizations.

What carries the argument

The side-by-side comparison of real human survey answers and LLM-generated synthetic answers to the same questions on coder beliefs and organizational dynamics.

If this is right

  • LLMs can create technically plausible synthetic survey results.
  • All tested models deviated in similar ways from the human data.
  • Synthetic responses align more with average or expected patterns than with real variability.
  • Synthetic data is better suited for identifying societal assumptions than for modeling actual beliefs.
  • Future research needs clearer standards for when synthetic survey data can be used responsibly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers might use LLMs to first map out conventional expectations before designing human surveys.
  • Better results could come from prompting LLMs with specific instructions to include outlier or surprising responses.
  • This approach could apply to other fields like psychology or market research where human variability matters.
  • Long-term, it may push for hybrid methods that combine synthetic pre-tests with targeted human validation.

Load-bearing premise

The human survey of 420 coders provides a reliable benchmark for true organizational beliefs, and the prompts used fairly challenge the LLMs to match human variability instead of averages.

What would settle it

A new study that surveys a fresh group of coders and finds their responses align closely with the LLMs' synthetic outputs rather than the original human survey would challenge the finding that real data is the outlier.

read the original abstract

How well can AI-derived synthetic research data replicate the responses of human participants? An emerging literature has begun to engage with this question, which carries deep implications for organizational research practice. This article presents a comparison between a human-respondent survey of 420 Silicon Valley coders and developers and synthetic survey data designed to simulate real survey takers generated by five leading Generative AI Large Language Models: ChatGPT Thinking 5 Pro, Claude Sonnet 4.5 Pro plus Claude CoWork 1.123, Gemini Advanced 2.5 Pro, Incredible 1.0, and DeepSeek 3.2. Our findings reveal that while AI agents produced technically plausible results that lean more towards replicability and harmonization than assumed, none were able to capture the counterintuitive insights that made the human survey valuable. Moreover, deviations grouped together for all models, leaving the real data as the outlier. Our key finding is that while leading LLMs are increasingly being used to scale, replicate and replace human survey responses in research, these advances only show an increased capacity to parrot conventional wisdom in harmony with each other rather than revealing novel findings. If synthetic respondents are used in future research, we need more replicable validation protocols and reporting standards for when and where synthetic survey data can be used responsibly, a gap that this paper fills. Our results suggest that synthetic survey responses cannot meaningfully model real human social beliefs within organizations, particularly in contexts lacking previously documented evidence. We conclude that synthetic survey-based research should be cast not as a substitute for rigorous survey methods, but as an increasingly reliable pre- or post-fieldwork instrument for identifying societal assumptions, conventional wisdoms, and other expectations about research populations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper compares responses from a human survey of 420 Silicon Valley coders and developers against synthetic survey data generated by five LLMs (ChatGPT Thinking 5 Pro, Claude Sonnet 4.5 Pro plus Claude CoWork 1.123, Gemini Advanced 2.5 Pro, Incredible 1.0, and DeepSeek 3.2). It reports that the LLMs produce technically plausible results that harmonize around conventional patterns but fail to replicate the counterintuitive insights present in the human data, with all model deviations clustering together and leaving the real data as the outlier. The authors conclude that synthetic responses cannot meaningfully model real human social beliefs in organizations (especially without prior documentation) and should serve only as a pre- or post-fieldwork instrument rather than a substitute for human surveys.

Significance. If the central empirical comparison holds after methodological clarification, the result would be significant for organizational research and the growing use of synthetic data in social science. It supplies concrete evidence that current LLMs tend to reproduce consensus patterns rather than novel or counterintuitive beliefs, underscoring the need for explicit validation protocols before synthetic data can be treated as interchangeable with human respondents.

major comments (2)
  1. [Methods] Methods section (and abstract): the paper provides no details on the exact prompting strategies, temperature or sampling parameters, number of synthetic responses generated per model, or the statistical tests and effect-size measures used to identify 'counterintuitive insights' and to compare distributions between human and LLM data. These omissions are load-bearing because the claim that 'none were able to capture the counterintuitive insights' cannot be evaluated without knowing how those insights were operationalized or how variance was handled.
  2. [Results/Discussion] Results and Discussion: the central claim that the human survey of 420 coders constitutes reliable ground truth (with real data as the outlier) is not supported by any comparison of the reported counterintuitive findings to existing empirical literature on Silicon Valley developer attitudes (e.g., Stack Overflow surveys or academic studies of tech organizational culture). Without this external anchor, the observed LLM harmonization could reflect accurate modeling of documented consensus rather than a failure to capture organizational reality.
minor comments (1)
  1. [Abstract] Abstract: the model list contains the non-standard phrasing 'Claude Sonnet 4.5 Pro plus Claude CoWork 1.123'; clarify the precise model versions and whether any ensemble or multi-agent setup was used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Methods] Methods section (and abstract): the paper provides no details on the exact prompting strategies, temperature or sampling parameters, number of synthetic responses generated per model, or the statistical tests and effect-size measures used to identify 'counterintuitive insights' and to compare distributions between human and LLM data. These omissions are load-bearing because the claim that 'none were able to capture the counterintuitive insights' cannot be evaluated without knowing how those insights were operationalized or how variance was handled.

    Authors: We agree that these methodological details are essential for evaluating our claims and should have been included in the original submission. In the revised manuscript, we will substantially expand the Methods section (and update the abstract if necessary) to provide: the exact prompting strategies and sample prompts used for each of the five LLMs; the temperature settings (we used 0.7 for all models) and other sampling parameters; the number of synthetic responses generated per model (420 to match the human sample); and the specific statistical tests (chi-square for categorical responses, independent t-tests for Likert scales) along with effect size calculations (Cohen's d) used to identify counterintuitive insights. Counterintuitive insights were operationalized as those human responses that deviated significantly (p < 0.05, |d| > 0.5) from the patterns observed in the LLM data and from conventional expectations in the field. This addition will make our analysis fully reproducible and allow readers to assess the robustness of our conclusions. revision: yes

  2. Referee: [Results/Discussion] Results and Discussion: the central claim that the human survey of 420 coders constitutes reliable ground truth (with real data as the outlier) is not supported by any comparison of the reported counterintuitive findings to existing empirical literature on Silicon Valley developer attitudes (e.g., Stack Overflow surveys or academic studies of tech organizational culture). Without this external anchor, the observed LLM harmonization could reflect accurate modeling of documented consensus rather than a failure to capture organizational reality.

    Authors: This is a fair criticism, and we appreciate the suggestion to strengthen the external validity of our findings. While the manuscript's focus is on the direct comparison between our specific human survey and the LLM-generated data (rather than a meta-analysis of developer attitudes), we will revise the Discussion section to include references to key existing studies and surveys on Silicon Valley/tech developer attitudes (such as the annual Stack Overflow Developer Survey and relevant organizational behavior literature). We will explicitly note which of our counterintuitive findings align with or diverge from these documented patterns. However, we maintain that for the purpose of this study—testing whether LLMs can replicate the specific responses from our human sample without prior documentation—the human data remains the appropriate benchmark. The harmonization among LLMs around conventional patterns, even if those patterns match some literature, still supports our conclusion that synthetic data struggles with novel or context-specific insights. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of human vs. synthetic survey responses

full rationale

The paper performs a straightforward empirical test: it collects real responses from 420 human coders and compares them to synthetic responses generated by five LLMs under controlled prompts. No equations, fitted parameters, or derivations are present that could reduce any result to its own inputs by construction. The central claim—that LLMs converge on conventional patterns while missing counterintuitive human findings—rests on observed statistical differences between the two datasets, not on self-definition, renamed fits, or load-bearing self-citations. The treatment of the human survey as a benchmark is an explicit methodological choice open to external validation, not a circular reduction. This is a standard non-circular empirical design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical comparison study with no mathematical derivations, fitted parameters, or postulated entities described in the abstract.

pith-pipeline@v0.9.0 · 5624 in / 1156 out tokens · 51939 ms · 2026-05-16T05:26:05.711785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.