Generating Public Health Responses using Survey-Augmented Large Language Models

Alina Hyk; Chunyang Liao; Julia Rezvani; Konstantinos Mitsopoulos; Leonardo Marciaga; Raffaele Vardavas; Thuyen Pham

arxiv: 2606.21820 · v1 · pith:DMEVUVG2new · submitted 2026-06-20 · 💻 cs.SI · cs.AI· cs.CL

Generating Public Health Responses using Survey-Augmented Large Language Models

Leonardo Marciaga , Thuyen Pham , Julia Rezvani , Alina Hyk , Chunyang Liao , Konstantinos Mitsopoulos , Raffaele Vardavas This is my paper

Pith reviewed 2026-06-26 11:22 UTC · model grok-4.3

classification 💻 cs.SI cs.AIcs.CL

keywords synthetic survey datalarge language modelsepidemic modelingvaccination attitudespublic health responsesdata augmentationcluster-informed promptingFluPaths survey

0 comments

The pith

Large language models generate synthetic survey data that matches real marginal distributions of demographics and health attitudes but not their joint variation within individuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can create synthetic responses to health surveys that mirror real population patterns in vaccination attitudes and behaviors. This matters because real surveys are expensive and limited in scope, so synthetic data could help build epidemiological models without repeated large-scale collection. The authors first cluster real FluPaths survey respondents by attitudes, then use those groupings to prompt several LLMs across epidemic waves. The generated responses reproduce overall distributions of characteristics, beliefs, perceptions, and behaviors, and some models track group-level vaccination trends across waves. However, the synthetic records remain distinguishable from real ones, and within-person correlations are not captured well.

Core claim

Using a cluster-informed prompting approach derived from longitudinal FluPaths survey data, LLMs produce synthetic responses whose marginal distributions of demographic characteristics, vaccination-related beliefs, risk perceptions, and health behaviors align with observed survey data, with varying reliability across models and waves in reproducing group-level vaccination trends, although a trained classifier can still identify the records as synthetic.

What carries the argument

Cluster-informed prompting strategy that identifies groups with positive or negative vaccination attitudes from real survey data and incorporates those groupings into LLM prompts to generate new responses.

If this is right

The synthetic data can serve as a tool for exploratory data augmentation in agent-based epidemic modeling.
Performance in matching group-level vaccination trends varies by model and across different epidemic waves.
Generated responses should not replace human survey data without further methodological improvements and validation.
Some LLMs reproduce the observed group-level trends more reliably than others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving capture of within-respondent correlations could allow synthetic data to support more detailed simulations of individual health decisions during outbreaks.
The same prompting approach might be tested on surveys about other behaviors, such as economic choices or social attitudes.
Periodic comparison against newly collected real surveys would be required to check whether the method remains useful as population patterns shift.

Load-bearing premise

That prompting LLMs with groupings taken from real survey clusters will make the generated responses statistically close enough to unobserved population patterns to be useful for exploratory epidemic modeling.

What would settle it

A fresh, independent survey under comparable conditions whose joint distributions or within-respondent correlations differ markedly from those in the synthetic data, or where a real-versus-synthetic classifier reaches near-perfect accuracy on new records.

Figures

Figures reproduced from arXiv: 2606.21820 by Alina Hyk, Chunyang Liao, Julia Rezvani, Konstantinos Mitsopoulos, Leonardo Marciaga, Raffaele Vardavas, Thuyen Pham.

**Figure 1.** Figure 1: Step 1: Data preparation and generation of cluster descriptions. An example of cluster descriptions and cluster samples. responses for each questions. The objective of this step is to transform original metadata (inconsistently formatted) into standardized natural-language summarizes. Specifically, we standardize dataset information into a structured format consisting of the dataset’s context, target varia… view at source ↗

**Figure 2.** Figure 2: Step 2: Formatted metadata generation. An example of metadata and formatted metadata. Evaluation We evaluate the quality of the generated data by measuring its similarity in data spread and distribution shape compared to the original data. We adopt four metrics to evaluate data similarity, which are the shape, the trend, the Classifier Two-Sample Test (CS2T), and the overall score. Those four metrics are w… view at source ↗

**Figure 3.** Figure 3: Step 3: Synthetic data prompting. We uses the outputs produced in the previous stages of the workflow as inputs to prompt the synthetic data generation. was applied. For each respondent in each wave, we build another variable, already_vacc, which pools the information from already_vaccinated_flu and already_vaccinated_flu_rec. Namely, it will equal True if at least one of from already_vaccinated_flu and al… view at source ↗

**Figure 4.** Figure 4: Real and synthetic vaccination rates across waves. We then separately compare the real and synthetic vaccination rates within each cluster. For Waves 3 and 5, Claude Sonnet 4 overestimates Cluster 1 and underestimates Cluster 2, while GPT 3.5 Turbo does the opposite. GPT 4o Mini and Gemini 1.5 Pro follow a similar trend. In Wave 3, they underestimate both clusters, while they underestimate Cluster 1 but ov… view at source ↗

**Figure 5.** Figure 5: Vaccination rates generated by LLMs using the zero-shot prompting approach. A.2.2. Few-shot Prompting We present an example of prompts for few-shot prompting technique. You are tasked with generating realistic synthetic data for flu vaccination patterns using few-shot learning. You have been provided with real examples from human participants across different flu seasons. IMPORTANT: These examples are prov… view at source ↗

**Figure 6.** Figure 6: Vaccination rates generated by LLMs using the few-shot prompting approach. In [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

read the original abstract

Epidemiological models often rely on survey data to represent how individuals make health-related decisions, such as whether to vaccinate or adopt protective behaviors. However, repeated large-scale surveys are costly, time-consuming, and limited in the range of scenarios they can capture. In this work, we investigate whether large language models (LLMs) can generate synthetic survey responses that reproduce patterns observed in real populations. Using longitudinal data from the FluPaths surveys, we first identify groups associated with broadly positive or negative attitudes toward vaccination through clustering analysis. We then evaluate several LLMs using a cluster-informed prompting approach to generate synthetic survey responses across multiple epidemic waves. Across models, the synthetic data generally reproduce the distributions of demographic characteristics, vaccination-related beliefs, risk perceptions, and health behaviors observed in the survey data. However, they are less successful at capturing how these factors vary together within respondents. Some models reproduce group-level vaccination trends more reliably than others, although performance varies across waves. We also trained a classifier to distinguish real from synthetic records and found that the generated responses remained identifiable as synthetic. Overall, our findings suggest that LLM-generated survey data may provide a useful tool for exploratory data augmentation and we hope that it could support agent-based epidemic modeling approaches. However, the generated data should not be treated as a substitute for human survey data without further methodological improvements and validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cluster-informed LLM prompting matches marginals on FluPaths data but fails on joints and stays distinguishable, limiting its value for agent-based models.

read the letter

The core result here is that several LLMs, prompted with cluster-derived attitude groups from the FluPaths longitudinal surveys, can reproduce the separate distributions of demographics, beliefs, risk perceptions, and behaviors, yet they do not recover how those variables move together inside individual respondents. A classifier still picks out the synthetic records, and the authors state this directly.

What the work actually does is take an established synthetic-data technique, add a clustering step on real survey groupings, and run it across multiple epidemic waves and models. The evaluation is empirical and reports both the marginal successes and the joint failures without claiming the outputs are ready for substitution. That balance is the main thing the paper contributes.

The limitation that matters most is the one the abstract already flags: agent-based models need coherent individual bundles, not just correct averages. If the generated records mis-specify the correlations between, say, negative vaccination beliefs and protective behaviors, they cannot stand in for survey data in the intended downstream use. Variation across waves and models is noted but not analyzed in depth, so it is hard to tell how stable the approach is.

This is the kind of paper that belongs in a methods or data-augmentation track rather than a core epidemiology venue. Readers who are already experimenting with LLMs for survey augmentation will find the concrete failure modes useful. It is worth sending to referees because the evaluation is honest and the dataset is real; the authors do not overstate what they have shown.

Referee Report

1 major / 1 minor

Summary. The manuscript investigates whether LLMs can generate synthetic survey responses that reproduce patterns from the FluPaths longitudinal surveys on vaccination attitudes and health behaviors. Using clustering to identify groups with positive or negative vaccination attitudes, the authors apply cluster-informed prompting to several LLMs to produce synthetic records across epidemic waves. They report that the synthetic data generally match the marginal distributions of demographics, vaccination beliefs, risk perceptions, and behaviors, but are less successful at reproducing within-respondent covariation; classifiers can still distinguish real from synthetic records; and some models better capture group-level trends. The conclusion is that such data may support exploratory augmentation for epidemic modeling but should not substitute for real survey data without further validation.

Significance. If the empirical results hold, the work supplies a transparent, balanced assessment of LLM performance on synthetic public-health survey generation, documenting both marginal-distribution successes and joint-distribution failures. This is directly relevant to researchers exploring data augmentation for agent-based models, as it quantifies the gap between marginal calibration and the individual-level attribute bundles required by such models. The use of real survey clusters for prompting, multi-wave evaluation, and explicit classifier tests are strengths that make the negative findings on joints particularly informative.

major comments (1)

[Abstract] Abstract: The suggestion that the generated data 'could support agent-based epidemic modeling approaches' sits in tension with the explicit finding that the synthetic records are 'less successful at capturing how these factors vary together within respondents.' Agent-based models require consistent per-respondent attribute bundles (e.g., correlated beliefs and behaviors), not merely marginal or group-level agreement; the manuscript should either supply evidence that marginal matching is sufficient for the intended exploratory uses or qualify the modeling claim more narrowly.

minor comments (1)

[Methods] The description of how the identified clusters are translated into prompt text could be expanded with an example prompt template to improve reproducibility of the cluster-informed strategy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this constructive observation on the abstract. The comment correctly identifies a tension between our empirical findings on joint distributions and the phrasing of potential applications. We will revise the abstract to qualify the modeling claim more narrowly, consistent with the manuscript body.

read point-by-point responses

Referee: [Abstract] The suggestion that the generated data 'could support agent-based epidemic modeling approaches' sits in tension with the explicit finding that the synthetic records are 'less successful at capturing how these factors vary together within respondents.' Agent-based models require consistent per-respondent attribute bundles (e.g., correlated beliefs and behaviors), not merely marginal or group-level agreement; the manuscript should either supply evidence that marginal matching is sufficient for the intended exploratory uses or qualify the modeling claim more narrowly.

Authors: We agree that the abstract phrasing risks overstating applicability. Our results explicitly document weaker performance on within-respondent covariation, which is a substantive limitation for ABMs that depend on realistic individual-level bundles. No additional evidence is provided that marginal matching alone suffices for such models. We will revise the abstract to state that the synthetic data may aid exploratory augmentation but should not be assumed sufficient for ABMs without further validation of joint distributions, aligning the claim with the paper's conclusions and the referee's point. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons to external survey data

full rationale

The paper performs clustering on real FluPaths survey data, uses the resulting groups to construct prompts for LLMs, generates synthetic records, and evaluates them via direct statistical comparison of marginal distributions and a trained classifier against the held-out real survey responses. No derivation, equation, or prediction reduces to a fitted parameter or self-defined quantity by construction; no load-bearing self-citation chain or imported uniqueness theorem appears; all performance claims rest on external empirical benchmarks rather than internal re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about LLM prompt-following capabilities and the representativeness of the FluPaths survey sample without introducing fitted parameters, new entities, or ad-hoc axioms beyond domain-standard ones.

axioms (1)

domain assumption LLMs prompted with cluster-derived information can produce outputs whose marginal statistics align with real population patterns in health decision-making
This underpins the prompting strategy and the interpretation of distribution-matching results.

pith-pipeline@v0.9.1-grok · 5793 in / 1454 out tokens · 33140 ms · 2026-06-26T11:22:21.930369+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references

[1]

ordinal” or “binary

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.1110. URLhttps: //github.com/JosephJeesungSuh/subpop. O. Toubia, G. Z. Gui, T. Peng, D. J. Merlau, A. Li, and H. Chen. Database report: Twin-2K-500: A data set for building digital twins of over 2,000 people based on their answers to over 500 questions.Marketing Science, 44(6):144...

doi:10.18653/v1/2024.emnlp-main.1110 2024
[2]

The dataset description, in one sentence
[3]

If no target exists, choose one from the column as target for the dataset to classify

The target of the dataset. If no target exists, choose one from the column as target for the dataset to classify
[4]

context":

The features and their explanations. Below is a sample output produced by a large language model. { "context": "This is wave 3 of a longitudinal survey studying behavioral and beliefs aspects related to influenza vaccination, conducted in Fall 2017 following a previous wave in Fall 2016.", "target": "already_vacc: Whether the individual has been vaccinate...

2017
[5]

Generate data for exactly {n_outputs} individuals conforming to the given metadata and examples
[6]

Use the provided examples and cluster descriptions to recreate those patterns to the best of your ability
[7]

Below we include the first five rows of output of Step 3 for Wave 3

Consider real-world factors, including, but not limited to: vaccine hesitancy, changing attitudes, seasonal patterns, demographics, and health status. Below we include the first five rows of output of Step 3 for Wave 3. calcage gender neversometimesalways belief_xn belief_sa already_vacc cluster prev_cluster 45 2 1 5 4 1 1 1 67 1 1 4 5 1 1 1 29 2 2 3 4 2 ...

2016
[8]

A brief demographic/health profile explaining their vaccination tendencies
[9]

In Figure5, weshowvaccination ratesgenerated by various LLMsusing the zero-shotprompting approach

Vaccination decision for each season (1 = vaccinated, 0 = not vaccinated) Structure your response as follows: Person 1: Profile: [Brief explanation of demographics, health status, and general vaccination tendency - 1-2 sentences] Decisions: 2016-2017: [0 or 1], 2017-2018: [0 or 1], 2018-2019: [0 or 1], 2019-2020: [0 or 1], 2020-2021: [0 or 1], 2021-2022: ...

2016
[10]

PARTICIPANT PROFILE: Brief demographic/health profile explaining their vaccination tendencies (1-2 sentences)
[11]

VACCINATION DECISIONS: Decision for each season (1 = vaccinated, 0 = not vaccinated) Format: 2016-2017: [0 or 1], 2017-2018: [0 or 1], ..., 2023-2024: [0 or 1]

2016
[12]

OVERALL REASONING: Your overall reasoning and justification for the patterns you created
[13]

FEW-SHOT INSIGHTS: How the few-shot examples helped you understand vaccination behavior patterns
[14]

DECISION INFLUENCE: How the real examples influenced your synthetic data generation decisions
[15]

ADDITIONAL DATA NEEDS: What other data you would want in few-shot examples to generate more accurate responses 22 Generating Public Health Responses using Survey-Augmented Large Language Models
[16]

Do not simply replicate what you see

CONFIDENCE SCORE: Your confidence in the accuracy of your response (scale 1-10, where 10 = very confident) OUTPUT FORMAT: Person 1: Profile: [Brief explanation] Decisions: 2016-2017: [0 or 1], 2017-2018: [0 or 1], 2018-2019: [0 or 1], 2019-2020: [0 or 1], 2020-2021: [0 or 1], 2021-2022: [0 or 1], 2022-2023: [0 or 1], 2023-2024: [0 or 1] [Continue for all ...

2016

[1] [1]

ordinal” or “binary

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.1110. URLhttps: //github.com/JosephJeesungSuh/subpop. O. Toubia, G. Z. Gui, T. Peng, D. J. Merlau, A. Li, and H. Chen. Database report: Twin-2K-500: A data set for building digital twins of over 2,000 people based on their answers to over 500 questions.Marketing Science, 44(6):144...

doi:10.18653/v1/2024.emnlp-main.1110 2024

[2] [2]

The dataset description, in one sentence

[3] [3]

If no target exists, choose one from the column as target for the dataset to classify

The target of the dataset. If no target exists, choose one from the column as target for the dataset to classify

[4] [4]

context":

The features and their explanations. Below is a sample output produced by a large language model. { "context": "This is wave 3 of a longitudinal survey studying behavioral and beliefs aspects related to influenza vaccination, conducted in Fall 2017 following a previous wave in Fall 2016.", "target": "already_vacc: Whether the individual has been vaccinate...

2017

[5] [5]

Generate data for exactly {n_outputs} individuals conforming to the given metadata and examples

[6] [6]

Use the provided examples and cluster descriptions to recreate those patterns to the best of your ability

[7] [7]

Below we include the first five rows of output of Step 3 for Wave 3

Consider real-world factors, including, but not limited to: vaccine hesitancy, changing attitudes, seasonal patterns, demographics, and health status. Below we include the first five rows of output of Step 3 for Wave 3. calcage gender neversometimesalways belief_xn belief_sa already_vacc cluster prev_cluster 45 2 1 5 4 1 1 1 67 1 1 4 5 1 1 1 29 2 2 3 4 2 ...

2016

[8] [8]

A brief demographic/health profile explaining their vaccination tendencies

[9] [9]

In Figure5, weshowvaccination ratesgenerated by various LLMsusing the zero-shotprompting approach

Vaccination decision for each season (1 = vaccinated, 0 = not vaccinated) Structure your response as follows: Person 1: Profile: [Brief explanation of demographics, health status, and general vaccination tendency - 1-2 sentences] Decisions: 2016-2017: [0 or 1], 2017-2018: [0 or 1], 2018-2019: [0 or 1], 2019-2020: [0 or 1], 2020-2021: [0 or 1], 2021-2022: ...

2016

[10] [10]

PARTICIPANT PROFILE: Brief demographic/health profile explaining their vaccination tendencies (1-2 sentences)

[11] [11]

VACCINATION DECISIONS: Decision for each season (1 = vaccinated, 0 = not vaccinated) Format: 2016-2017: [0 or 1], 2017-2018: [0 or 1], ..., 2023-2024: [0 or 1]

2016

[12] [12]

OVERALL REASONING: Your overall reasoning and justification for the patterns you created

[13] [13]

FEW-SHOT INSIGHTS: How the few-shot examples helped you understand vaccination behavior patterns

[14] [14]

DECISION INFLUENCE: How the real examples influenced your synthetic data generation decisions

[15] [15]

ADDITIONAL DATA NEEDS: What other data you would want in few-shot examples to generate more accurate responses 22 Generating Public Health Responses using Survey-Augmented Large Language Models

[16] [16]

Do not simply replicate what you see

CONFIDENCE SCORE: Your confidence in the accuracy of your response (scale 1-10, where 10 = very confident) OUTPUT FORMAT: Person 1: Profile: [Brief explanation] Decisions: 2016-2017: [0 or 1], 2017-2018: [0 or 1], 2018-2019: [0 or 1], 2019-2020: [0 or 1], 2020-2021: [0 or 1], 2021-2022: [0 or 1], 2022-2023: [0 or 1], 2023-2024: [0 or 1] [Continue for all ...

2016