pith. sign in

arxiv: 2606.04592 · v1 · pith:R2MYTA5Snew · submitted 2026-06-03 · 💻 cs.CY · cs.AI· cs.HC

Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?

Pith reviewed 2026-06-28 04:14 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HC
keywords digital twinsLLM respondentssocio-economic panelmarket researchsynthetic respondentsinformation entropySOEPheld-out prediction
0
0 comments X

The pith

LLM-based individual twins from existing socio-economic panel data reach 78.8 percent accuracy on held-out questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether LLMs can generate detailed synthetic individuals, or twins, that mimic real respondents using only pre-existing heterogeneous microdata from panels like the German Socio-Economic Panel rather than purpose-built surveys. It tests this across a grid of three open-weight models, five levels of cumulative information ranked by normalized Shannon entropy, two embedding approaches, and two reasoning modes, scoring more than two million responses from 500 participants on 183 unseen questions. Twin performance improves steadily with added information depth yet shows clear diminishing returns after the 75 percent entropy quartile, which emerges as a practical cost-efficient balance point. The strongest configurations reach 78.8 percent accuracy and a Fisher-z correlation of 0.590, indicating that firms' accumulated CRM and loyalty records can support operationally useful twins without new primary data collection.

Core claim

Detailed individual-level twins constructed from SOEP microdata achieve best-cell accuracy of 78.8 percent and Fisher-z correlation of r = 0.590 on a held-out set of 183 questions, with quality rising alongside information depth but exhibiting diminishing returns past the 75 percent entropy quartile that functions as a cost-efficient Pareto point relative to full-data cells; raw dialog-history embeddings outperform narrative summaries at full depth while explicit thinking improves rank-order correlation without raising accuracy.

What carries the argument

The 3 × 5 × 2 × 2 construction-method grid of three LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes.

If this is right

  • Market researchers can construct detailed individual twins from existing panel, CRM, and loyalty data without designing new primary surveys.
  • The 75 percent entropy quartile supplies a concrete stopping rule for data inclusion that preserves most performance at lower collection cost.
  • Raw dialog-history embeddings raise hold-out accuracy across all tested models and reasoning modes at full information depth.
  • Explicit chain-of-thought reasoning improves rank-order correlation between twin and human responses without increasing raw accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grid could be applied to commercial CRM datasets to test whether the entropy-quartile pattern and accuracy ceiling generalize beyond national panels.
  • If the diminishing-returns curve holds, organizations could reduce new respondent recruitment by recycling historical response sequences for twin construction.
  • The approach invites direct comparison of twin predictions against actual future behavior on the same individuals to measure predictive validity beyond cross-sectional hold-out accuracy.

Load-bearing premise

The 183 held-out questions are representative of the operational questions market researchers would actually pose to these individuals, and the SOEP microdata contain no systematic gaps or selection effects that would make the twins unrepresentative for new items.

What would settle it

Collect a fresh battery of 183 questions from the same 500 SOEP participants and test whether the reported accuracy and correlation levels hold for the best-performing twin configurations.

Figures

Figures reproduced from arXiv: 2606.04592 by Jochen Hartmann, Leonard Kinzinger.

Figure 1
Figure 1. Figure 1: Cumulative distribution of normalized Shannon entropy across the 728 persona-context questions, sorted by descending entropy. Dashed vertical lines mark the boundaries of the four cumulative information depth levels. The dashed diagonal represents perfect equality (uniform entropy contribution). The distribution of normalized entropy across the 728 persona-context questions is non-uniform (see cumulative d… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy (left) and Fisher-z correlation (right) by information depth, with ±1 standard-error bands per model computed across the four construction cells (embedding × reasoning) at each depth. Across the cells, the bulk of the gain landed in the two middle transitions (25 → 50 % and 50 → 75 %): together they accounted for +4.3 of the +5.2 pp accuracy lift and +16.7 of the +21.9 pp correlation lift. The ope… view at source ↗
Figure 3
Figure 3. Figure 3: Per-model accuracy by information depth, with all four construction methods overlaid. To bound how much of the accuracy and correlation levels came from persona content rather than from instruction-following, we ran an empty-persona ablation where Qwen 3 Dialog Non-Thinking was re-evaluated at all five depths with the persona content replaced by an empty block ( [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empty-persona ablation. Personalized versus empty-persona accuracy by information depth on Qwen 3 Dialog Non-Thinking (n = 500 participants, 183 held-out questions). Empty-persona accuracy is depth-flat at 0.65– 0.66; the personalization delta widens from +4.2 pp at Basic Demographic to +10.8 pp at the 100 % Quartile. We also found that the entropy-based ranking of persona-context questions had less effect… view at source ↗
Figure 5
Figure 5. Figure 5: Personalized accuracy by information depth (Qwen 3 Dialog Non-Thinking, n = 500 participants, 183 held￾out items), split by difficulty tercile with 95 % confidence bands. Difficulty is defined by empty-persona accuracy at the 100 % Quartile. The hard tercile (red) absorbed 3.9× the lift of the easy tercile (blue) over the depth range. To better understand which questions drive the increase in accuracy and … view at source ↗
Figure 6
Figure 6. Figure 6: Z-scored composite fingerprint for the three models across the three twin-quality metrics. Each cell is the across-model z-score of the model’s per-cell-averaged metric value (accuracy / Fisher-z correlation / closeness to human variance, encoded as −|σtwin/σhuman −1| so higher = closer to parity = better). Higher and greener = better on all three columns. The per-question variance-ratio distributions ( [… view at source ↗
Figure 7
Figure 7. Figure 7: Per-question variance-ratio distribution σtwin/σhuman by model on a log x-axis. Each per-(model, question) value is the mean across the 20 construction cells per model of the per-cell winsorised SD ratio on that question; per-cell ratios are capped at 5 before averaging. Restricted to ordinal- and metric-scale items. Dashed grey line at ratio=1; solid black line at the per-model median. and +7.2 percentage… view at source ↗
Figure 8
Figure 8. Figure 8: Persona summary (left endpoint) versus Dialog input (right endpoint) per model, pooled across reasoning modes and information depths. Left: accuracy. Center: Fisher-z correlation. Right: variance ratio. Ministral 3 Non-Thinking (+9.8 pp), Gemma 4 Thinking (+8.5 pp), and Ministral 3 Thinking (+7.8 pp); Qwen 3, by contrast, picked up at most +2.4 pp from the Dialog embedding at 100 % depth. The accuracy gain… view at source ↗
Figure 9
Figure 9. Figure 9: Non-Thinking (left endpoint) versus Thinking (right endpoint) per model, pooled across embedding methods and information depths. Left: accuracy. Right: Fisher-z correlation. Accuracy barely moved between reasoning modes for any model (all |∆acc| ≤ 2 percentage points across cells; only Ministral 3’s +1.6 percentage-point shift reached a paired-t p-value below 0.05). Fisher-z correlation rose under Thinking… view at source ↗
read the original abstract

LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \times 5 \times 2 \times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper constructs detailed LLM-based individual digital twins from German Socio-Economic Panel (SOEP) microdata for 500 participants. It evaluates them on 183 held-out SOEP questions across a 3×5×2×2 grid (three open-weight LLMs, five cumulative information depths by normalized Shannon entropy, two embedding methods, two reasoning modes), scoring over 2.1 million responses. Key findings are that quality rises with information depth but shows diminishing returns past the 75% entropy quartile (a proposed Pareto-efficient point), raw dialog-history embeddings outperform narrative summaries, explicit thinking improves rank-order correlation but not accuracy, and best-cell performance reaches 78.8% accuracy and Fisher-z r=0.590.

Significance. If the results hold, the systematic grid and large response volume provide a useful empirical map of construction choices for individual-level twins from pre-existing heterogeneous panel data, identifying an efficient information-depth threshold and embedding effects. This is a concrete contribution to the digital-twin literature. The evaluation remains internal to the SOEP instrument, however, so the abstract's claim that twin-based market research is now gated only by item volume, model choice, and construction decisions rests on an untested transfer assumption.

major comments (1)
  1. Abstract: the concluding suggestion that twin-based market research is no longer gated by data design but by item volume, model selection, and construction decisions rests on accuracy and correlation measured exclusively on 183 held-out SOEP questions that share the same survey instrument, response scales, topic distribution, and selection process as the conditioning data. No experiment tests whether the same construction grid retains 78.8% accuracy or r=0.590 when target items are drawn from typical operational market-research instruments (brand preference, willingness-to-pay, ad recall). This directly affects transfer of the Pareto claim at the 75% entropy quartile.
minor comments (2)
  1. Abstract and methods: the reported metrics from 2.1 million responses lack error bars, exact prompting templates, participant selection details, and statistical tests on the diminishing-returns and Pareto claims.
  2. Notation: the abstract refers to 'Fisher-z correlation' without specifying the exact transformation or baseline used for the r=0.590 value.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed report and for identifying the scope limitation in our evaluation. We respond to the single major comment below and agree that a revision to the abstract is warranted.

read point-by-point responses
  1. Referee: [—] Abstract: the concluding suggestion that twin-based market research is no longer gated by data design but by item volume, model selection, and construction decisions rests on accuracy and correlation measured exclusively on 183 held-out SOEP questions that share the same survey instrument, response scales, topic distribution, and selection process as the conditioning data. No experiment tests whether the same construction grid retains 78.8% accuracy or r=0.590 when target items are drawn from typical operational market-research instruments (brand preference, willingness-to-pay, ad recall). This directly affects transfer of the Pareto claim at the 75% entropy quartile.

    Authors: We agree that the evaluation uses only held-out SOEP items and that no transfer experiments were performed on instruments containing brand preference, willingness-to-pay, or ad-recall questions. The abstract's suggestion therefore rests on the untested premise that performance patterns observed within the SOEP instrument will generalize to other survey-based market-research tasks. While SOEP covers a broad range of socio-economic topics that overlap with many market-research domains, this does not constitute direct evidence of transfer. We will revise the abstract to qualify the claim, making clear that the identified construction decisions and Pareto threshold apply to prediction within similar panel-survey instruments and that external validation on operational market-research items remains an open question for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external held-out benchmark

full rationale

The paper constructs LLM-based twins from SOEP microdata and measures performance via accuracy and Fisher-z correlation on 183 held-out questions from the same panel. These held-out responses constitute an external benchmark independent of any fitted parameters or construction choices inside the paper. No equations, self-citations, or ansatzes are shown to reduce the reported 78.8% accuracy or r=0.590 to quantities defined by the inputs themselves. The 3×5×2×2 grid varies construction methods but evaluates them against real respondent answers not used for conditioning, keeping the central claims falsifiable and non-circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an empirical benchmarking study without introducing new mathematical constructs, free parameters, or postulated entities. All elements are standard LLM evaluation practices applied to socio-economic data.

pith-pipeline@v0.9.1-grok · 5837 in / 1322 out tokens · 51387 ms · 2026-06-28T04:14:10.455839+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 5 canonical work pages

  1. [1]

    and Liebig, Stefan and Kroh, Martin and Richter, David and Schr

    Goebel, Jan and Grabka, Markus M. and Liebig, Stefan and Kroh, Martin and Richter, David and Schr. The. Jahrb. 2019 , volume =

  2. [2]

    2025 , doi =

    Socio-Economic Panel (. 2025 , doi =

  3. [3]

    European Sociological Review , volume=

    Exploring integration and migration dynamics: the research potentials of a large-scale longitudinal household study of refugees in Germany , author=. European Sociological Review , volume=. 2026 , publisher=

  4. [4]

    and Gerlitz, J.-Y

    Schupp, J. and Gerlitz, J.-Y. Big Five Inventory-SOEP (BFI-S). 2008. doi:10.6102/zis54

  5. [5]

    , title =

    Shannon, Claude E. , title =. The Bell System Technical Journal , year =

  6. [6]

    and Thomas, Joy A

    Cover, Thomas M. and Thomas, Joy A. , title =. 2006 , isbn =

  7. [7]

    Marketing Science , volume=

    Database report: Twin-2k-500: A data set for building digital twins of over 2,000 people based on their answers to over 500 questions , author=. Marketing Science , volume=. 2025 , publisher=

  8. [8]

    2505.09388 , archiveprefix =

    Qwen3 Technical Report , year =. 2505.09388 , archiveprefix =

  9. [9]

    2023 , eprint=

    From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting , author=. 2023 , eprint=

  10. [10]

    International Journal of Research in Marketing , year =

    Peres, Renana and Schreier, Martin and Schweidel, David and Sorescu, Alina , title =. International Journal of Research in Marketing , year =

  11. [11]

    Journal of Marketing , year =

    Arora, Neeraj and Chakraborty, Ishita and Nishimura, Yohei , title =. Journal of Marketing , year =

  12. [12]

    and Zhang, Heng , title =

    Wang, Mengxin and Zhang, Dennis J. and Zhang, Heng , title =. Marketing Science , year =

  13. [13]

    Harvard Business School Marketing Unit Working Paper , number =

    Brand, James and Israeli, Ayelet and Ngwe, Donald , title =. Harvard Business School Marketing Unit Working Paper , number =. 2023 , doi =

  14. [14]

    Marketing Science , volume=

    Frontiers: Can large language models capture human preferences? , author=. Marketing Science , volume=. 2024 , publisher=

  15. [15]

    Marketing Science , year =

    Li, Peiyao and Castelo, Noah and Katona, Zsolt and Sarvary, Miklos , title =. Marketing Science , year =

  16. [16]

    and Rau, Lea and Schmitt, Bernd , title =

    Sarstedt, Marko and Adler, Susanne J. and Rau, Lea and Schmitt, Bernd , title =. Psychology & Marketing , year =

  17. [17]

    PLOS ONE , year =

    Brucks, Melanie and Toubia, Olivier , title =. PLOS ONE , year =

  18. [18]

    , title =

    Chakraborty, Ishita and Chiong, Khai and Dover, Howard and Sudhir, K. , title =. Marketing Science , year =

  19. [19]

    and Busby, Ethan C

    Argyle, Lisa P. and Busby, Ethan C. and Fulda, Nancy and Gubler, Joshua R. and Rytting, Christopher and Wingate, David , title =. Political Analysis , year =

  20. [20]

    Proceedings of the International Conference on Machine Learning , pages =

    Santurkar, Shibani and Durmus, Esin and Ladhak, Faisal and Lee, Cinoo and Liang, Percy and Hashimoto, Tatsunori , title =. Proceedings of the International Conference on Machine Learning , pages =. 2023 , organization =

  21. [21]

    and Arriaga, Rosa I

    Aher, Gati V. and Arriaga, Rosa I. and Kalai, Adam Tauman , title =. Proceedings of the International Conference on Machine Learning , pages =. 2023 , organization =

  22. [22]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Suh, Joseph and Jahanparast, Erfan and Moon, Suhong and Kang, Minwoo and Chang, Serina , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2025 , doi =

  23. [23]

    Preprint , year=

    Predicting results of social science experiments using large language models , author=. Preprint , year=

  24. [24]

    Machine Bias: How Do Generative Language Models Answer Opinion Polls? , journal =

    Boelaert, Julien and Coavoux, Samuel and Ollion,. Machine Bias: How Do Generative Language Models Answer Opinion Polls? , journal =. 2025 , volume =

  25. [25]

    and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S

    Park, Joon Sung and Popowski, Lindsay and Cai, Carrie J. and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , title =. Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology , pages =. 2022 , doi =

  26. [26]

    O'Brien, Carrie J

    Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , title =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , articleno =. 2023 , isbn =. doi:10.1145/3586183.3606763 , abstract =

  27. [27]

    2026 , eprint=

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals , author=. 2026 , eprint=

  28. [28]

    Finetuning LLM s for Human Behavior Prediction in Social Science Experiments

    Kolluri, Akaash and Wu, Shengguang and Park, Joon Sung and Bernstein, Michael S. Finetuning LLM s for Human Behavior Prediction in Social Science Experiments. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1530

  29. [29]

    Bernstein

    Shaikh, Omar and Sapkota, Shardul and Rizvi, Shan and Horvitz, Eric and Park, Joon Sung and Yang, Diyi and Bernstein, Michael S. , title =. Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology , articleno =. 2025 , isbn =. doi:10.1145/3746059.3747722 , abstract =

  30. [30]

    URL https: //aclanthology.org/2025.acl-long.104/

    Orlikowski, Matthias and Pei, Jiaxin and R. Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.104

  31. [31]

    and Liu, Ryan and Richardson, Sean M

    Anthis, Jacy R. and Liu, Ryan and Richardson, Sean M. and Kozlowski, Austin C. and Koch, Bernard and Brynjolfsson, Erik and Evans, James and Bernstein, Michael S. , title =. Proceedings of the 42nd International Conference on Machine Learning , articleno =. 2025 , publisher =

  32. [32]

    and Evans, James , title =

    Kozlowski, Austin C. and Evans, James , title =. Sociological Methods & Research , year =

  33. [33]

    2025 , eprint=

    Digital Twins as Funhouse Mirrors: Five Key Distortions , author=. 2025 , eprint=

  34. [34]

    and Busby, Ethan C

    Lyman, Alex and Hepner, Bryce and Argyle, Lisa P. and Busby, Ethan C. and Gubler, Joshua R. and Wingate, David , title =. Sociological Methods and Research , year =

  35. [35]

    LLM Generated Persona is a Promise with a Catch , url =

    Li, Leon and Chen, Haozhe and Namkoong, Hongseok and Peng, Tianyi , booktitle =. LLM Generated Persona is a Promise with a Catch , url =

  36. [36]

    and Schoenegger, Philipp and Zhu, Chongyang , title =

    Park, Peter S. and Schoenegger, Philipp and Zhu, Chongyang , title =. Behavior Research Methods , year =

  37. [37]

    and Dorff, Cassy and Kenkel, Brenton and Larson, Jennifer M

    Bisbee, James and Clinton, Joshua D. and Dorff, Cassy and Kenkel, Brenton and Larson, Jennifer M. , title =. Political Analysis , year =

  38. [38]

    Questioning the Survey Responses of Large Language Models , url =

    Dominguez-Olmedo, Ricardo and Hardt, Moritz and Mendler-D\". Questioning the Survey Responses of Large Language Models , url =. Advances in Neural Information Processing Systems , doi =

  39. [39]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages =

    Wang, Xinpeng and Ma, Bolei and Hu, Chengzhi and Weber-Genzel, Leon and R. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , doi =

  40. [40]

    and Subrahmanya, Shashanka and Sedoc, Jo

    Salecha, Aadesh and Ireland, Molly E. and Subrahmanya, Shashanka and Sedoc, Jo. Large Language Models Display Human-Like Social Desirability Biases in. PNAS Nexus , year =

  41. [41]

    Public Choice , year =

    Motoki, Fabio and Pinho Neto, Valdemar and Rodrigues, Victor , title =. Public Choice , year =

  42. [42]

    PLOS ONE , year =

    Rozado, David , title =. PLOS ONE , year =

  43. [43]

    International Conference on Learning Representations , year=

    Bias runs deep: Implicit reasoning biases in persona-assigned llms , author=. International Conference on Learning Representations , year=

  44. [44]

    2024 , eprint=

    Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis , author=. 2024 , eprint=

  45. [45]

    and Wagner, Claudia and Rammstedt, Beatrice and Strohmaier, Markus , title =

    Pellert, Max and Lechner, Clemens M. and Wagner, Claudia and Rammstedt, Beatrice and Strohmaier, Markus , title =. Perspectives on Psychological Science , year =

  46. [46]

    , title =

    Westwood, Sean J. , title =. Proceedings of the National Academy of Sciences , year =

  47. [47]

    Proceedings of the National Academy of Sciences , year =

    Gao, Yuan and Lee, Dokyun and Burtch, Gordon and Fazelpour, Sina , title =. Proceedings of the National Academy of Sciences , year =

  48. [48]

    Nature Computational Science , year =

    Cui, Ziyan and Li, Ning and Zhou, Huaikang , title =. Nature Computational Science , year =

  49. [49]

    , title =

    Mei, Qiaozhu and Xie, Yutong and Yuan, Walter and Jackson, Matthew O. , title =. Proceedings of the National Academy of Sciences , year =

  50. [50]

    and He, Lan and Xu, Xiao , title =

    Wang, Yifan and Zhao, Jingjing and Ones, Deniz S. and He, Lan and Xu, Xiao , title =. Scientific Reports , year =

  51. [51]

    and Le, Quoc V

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  52. [52]

    arXiv preprint arXiv:2408.03314 , year=

    Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

  53. [53]

    arXiv preprint arXiv:2508.10925 , year=

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  54. [54]

    arXiv preprint arXiv:2601.08584 , year=

    Ministral 3 , author=. arXiv preprint arXiv:2601.08584 , year=

  55. [55]

    Gemma-4-26B-A4B Model Card , year =

  56. [56]

    International Conference on Learning Representations , volume=

    Livebench: A challenging, contamination-limited llm benchmark , author=. International Conference on Learning Representations , volume=

  57. [57]

    Nature , volume=

    A foundation model to predict and capture human cognition , author=. Nature , volume=. 2025 , publisher=

  58. [58]

    2026 , eprint=

    Post-training makes large language models less human-like , author=. 2026 , eprint=

  59. [59]

    Available at SSRN 4802019 , year=

    Reducing Disparity Between LLMs and Humans: Optimal LLM Sample Calibration , author=. Available at SSRN 4802019 , year=

  60. [60]

    Journal of the Academy of Marketing Science , volume=

    A whole new world, a new fantastic point of view: Charting unexplored territories in consumer research with generative artificial intelligence , author=. Journal of the Academy of Marketing Science , volume=. 2025 , publisher=

  61. [61]

    Available at SSRN , year=

    Blind Spots in Broad Strokes: Caveats for the Use of LLMs in Marketing Research , author=. Available at SSRN , year=