Large Language Models for Market Research: A Data-augmentation Approach

Mengxin Wang (Naveen Jindal School of Management , The University of Texas at Dallas) , Dennis J. Zhang (Olin School of Business , Washington University in St. Louis) , Heng Zhang (W. P. Carey School of Business , Arizona State University)

Authors on Pith no claims yet

classification 💻 cs.AI cs.LGstat.MEstat.ML

keywords datallm-generatedhumanlanguageabilityanalysisapproachapproaches

0 comments

read the original abstract

Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two. In this paper, we address this gap by proposing a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis. This results in statistically robust estimators with consistent and asymptotically normal properties, in contrast to naive approaches that simply substitute human data with LLM-generated data, which can exacerbate bias. We further present a finite-sample performance bound on the estimation error. We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce estimation error and save data and costs by 24.9% to 79.8%. In contrast, naive approaches fail to save data due to the inherent biases in LLM-generated data compared to human data. Another empirical study on sports car choices validates the robustness of our results. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
cs.AI 2026-04 unverdicted novelty 7.0

A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
Adaptive Budget Allocation in LLM-Augmented Surveys
cs.LG 2026-04 unverdicted novelty 7.0

An adaptive budget allocation algorithm for LLM-augmented surveys learns question-level LLM reliability on the fly from human labels and reduces labeling waste from 10-12% to 2-6% compared to uniform allocation.
Generative Augmented Inference
cs.LG 2026-04 unverdicted novelty 6.0

GAI uses orthogonal moment conditions to integrate arbitrary AI-generated auxiliary data into human-label models, delivering consistent estimates, asymptotic normality, and a safe-default efficiency improvement over h...