pith. sign in

arxiv: 2401.07345 · v3 · submitted 2024-01-14 · 💰 econ.GN · q-fin.EC

Can an LLM Learn Preferences from Choice Data?

Pith reviewed 2026-05-24 04:52 UTC · model grok-4.3

classification 💰 econ.GN q-fin.EC
keywords large language modelspreference learningchoice under uncertaintydisappointment aversionrevealed preferencerecommendation systemsheterogeneous performance
0
0 comments X

The pith

Large language models improve at generating recommendations consistent with learned preferences as they observe more choices, though success varies by model and preference type.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can infer preferences from choice data and use them to recommend in new situations under uncertainty. It applies a Simulate-Recommend-Evaluate framework to the disappointment aversion model, generating optimal choices as benchmarks for comparison. Recommendation accuracy rises with more observed choices. Different LLMs show distinct strengths, with GPT performing better on risk aversion, Gemini on high disappointment aversion, and Claude across broader regions.

Core claim

The paper proposes a portable Simulate-Recommend-Evaluate framework that tests preference learning from revealed-choice data by comparing LLM recommendations with optimal choices implied by known preference primitives of the disappointment aversion model. Recommendation accuracy improves as models observe more choices, but learning is heterogeneous across preference types and LLMs: GPT learns risk aversion better than disappointment aversion, Gemini performs best in high disappointment-aversion regions, and Claude shows the broadest effective learning across parameter regions.

What carries the argument

Simulate-Recommend-Evaluate framework that benchmarks LLM outputs against exact optimal choices from the disappointment aversion model.

If this is right

  • Recommendation accuracy increases with the number of observed choices provided to the LLM.
  • Different LLMs exhibit distinct patterns of effective learning across risk aversion and disappointment aversion parameters.
  • LLMs can produce recommendations in unseen choice situations that align more closely with the underlying preference model after exposure to sufficient data.
  • Heterogeneity implies that model selection may matter for applications involving specific preference structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be applied to other preference models beyond disappointment aversion to map LLM strengths more broadly.
  • If heterogeneity persists, hybrid systems that route tasks to particular LLMs based on expected preference type might improve overall performance.
  • Economic experiments using LLMs as stand-ins for human subjects would need to account for these model-specific biases in learning.

Load-bearing premise

The optimal choices implied by the disappointment aversion model's preference primitives can be computed exactly to provide an objective ground-truth benchmark.

What would settle it

LLM recommendation accuracy that stays flat or falls as the number of observed choices increases would contradict the reported improvement.

Figures

Figures reproduced from arXiv: 2401.07345 by Euncheol Shin, Hector Tzavellas, Jeongbin Kim, Kyu-Min Lee, Matthew Kovach.

Figure 1
Figure 1. Figure 1: Illustration of CCEI and the deviation from EUT index [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of disappointment and risk aversion parameters [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of disappointment and risk aversion parameters in comparison to representative [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the results for β [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the results for ρ complete table of estimates can be found in SI Appendix D. When only one choice sample is provided to GPT, no meaningful personalization is observed, as expected. The estimate regression coefficient, 0.002, is not significantly different from zero (p-value = 0.960). However, aside from this extreme case, we observe that all estimates are strictly and significantly greater … view at source ↗
read the original abstract

Can large language models (LLMs) learn a decision maker's preferences from observed choices and generate preference-consistent recommendations in new situations? We propose a portable Simulate-Recommend-Evaluate framework that tests preference learning from revealed-choice data by comparing LLM recommendations with optimal choices implied by known preference primitives. We apply the framework to choice under uncertainty using the disappointment aversion model. Recommendation accuracy improves as models observe more choices, but learning is heterogeneous across preference types and LLMs: GPT learns risk aversion better than disappointment aversion, Gemini performs best in high disappointment-aversion regions, and Claude shows the broadest effective learning across parameter regions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Simulate-Recommend-Evaluate framework to test whether LLMs can infer preferences from revealed choice data and produce consistent recommendations in new settings. It applies the framework to choice under uncertainty using the disappointment aversion (DA) model of Gul (1991), comparing LLM outputs against optimal choices derived from known DA parameters. The central empirical claim is that recommendation accuracy rises with the number of observed choices, but learning is heterogeneous: GPT learns risk aversion better than disappointment aversion, Gemini excels in high-DA regions, and Claude shows the broadest learning across parameter space.

Significance. If the ground-truth benchmark is reliable and the heterogeneity patterns are robust, the portable framework could provide a replicable method for evaluating preference learning in LLMs against explicit economic primitives, with potential applications in AI-assisted decision making. The approach is notable for its use of independently computed optima rather than fitted parameters.

major comments (2)
  1. [Methods (benchmark computation)] Methods section describing the benchmark computation: the paper must specify the numerical procedure (fixed-point solver, optimization routine, or closed-form approximation) used to obtain the 'optimal choices implied by known preference primitives' for the DA model. Because DA certainty equivalents generally require solving a fixed-point equation involving the reference point and disappointment parameter β, any unreported tolerance, convergence criterion, or sensitivity analysis risks misattributing small benchmark errors to LLM performance differences, particularly in the heterogeneous regions highlighted for GPT, Gemini, and Claude.
  2. [Results (heterogeneity analysis)] Results on heterogeneity (e.g., the GPT vs. Gemini vs. Claude comparisons): without reported sample sizes per parameter region, choice-generation procedure, prompting details, or statistical tests for the accuracy differences, it is impossible to assess whether the reported patterns (GPT better on risk aversion, Gemini on high-DA) are statistically distinguishable from noise or from variation in the numerical ground truth.
minor comments (2)
  1. [Abstract] The abstract states results on accuracy and heterogeneity but supplies no information on choice generation, prompting, sample sizes, or controls; the full paper should ensure these details appear in the main text or a dedicated methods subsection.
  2. [Model description] Notation for the DA parameters (e.g., β, reference point) should be defined at first use and kept consistent with Gul (1991) to avoid ambiguity when describing the ground-truth computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments on the benchmark computation and heterogeneity analysis. We address each below.

read point-by-point responses
  1. Referee: Methods section describing the benchmark computation: the paper must specify the numerical procedure (fixed-point solver, optimization routine, or closed-form approximation) used to obtain the 'optimal choices implied by known preference primitives' for the DA model. Because DA certainty equivalents generally require solving a fixed-point equation involving the reference point and disappointment parameter β, any unreported tolerance, convergence criterion, or sensitivity analysis risks misattributing small benchmark errors to LLM performance differences, particularly in the heterogeneous regions highlighted for GPT, Gemini, and Claude.

    Authors: We agree that the numerical procedure must be fully specified to avoid potential misattribution of errors. The revised manuscript will add an explicit description of the fixed-point solver (including the iterative algorithm, tolerance of 1e-6, convergence criterion, and any sensitivity checks), ensuring the benchmark is transparent and reproducible. revision: yes

  2. Referee: Results on heterogeneity (e.g., the GPT vs. Gemini vs. Claude comparisons): without reported sample sizes per parameter region, choice-generation procedure, prompting details, or statistical tests for the accuracy differences, it is impossible to assess whether the reported patterns (GPT better on risk aversion, Gemini on high-DA) are statistically distinguishable from noise or from variation in the numerical ground truth.

    Authors: We acknowledge the need for these details to support statistical assessment. The revision will report sample sizes per region, the full choice-generation procedure, prompting templates, and formal statistical tests (e.g., pairwise proportion tests) for accuracy differences across LLMs and parameter regions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external model-derived ground truth

full rationale

The paper's Simulate-Recommend-Evaluate framework generates synthetic choice data from fixed disappointment-aversion parameters, supplies it to LLMs, and scores LLM outputs against optimal choices computed directly from those same known parameters. This structure treats the preference model as an independent benchmark rather than deriving the benchmark from LLM behavior or fitting LLM outputs back into the evaluation metric. No equations redefine predictions as inputs, no self-citations carry the central claim, and the comparison chain remains self-contained against the external model primitives.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5633 in / 1104 out tokens · 25462 ms · 2026-05-24T04:52:14.538026+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    Afriat, S. N. (1967): The Construction of Utility Functions from Expenditure Data, International Economic Review, 8, 67--77

  2. [2]

    --- -.1pt --- -.1pt --- (1973): On a System of Inequalities in Demand Analysis: An Extension of the Classical Method, International Economic Review, 14, 460--472

  3. [3]

    Schulz, J

    Akata, E., L. Schulz, J. Coda-Forno, S. J. Oh, M. Bethge, and E. Schulz (2023): Playing repeated games with Large Language Models, Working Paper

  4. [4]

    Brookins, P. and J. DeBacker (2023): Playing games with GPT: What can we learn about a large language model from canonical strategic games, Working Paper

  5. [5]

    Chambers, C. P. and F. Echenique (2016): Revealed Preference Theory, Cambridge University Press

  6. [6]

    Jabarian, and J

    Charness, G., B. Jabarian, and J. A. List (2023): Generation Next: Experimentation with AI, Working Paper 31679, National Bureau of Economic Research

  7. [7]

    Zaharia, and J

    Chen, L., M. Zaharia, and J. Zou (2023 a ): How Is ChatGPT’s Behavior Changing over Time, Working Paper

  8. [8]

    Chen, Y., T. X. Liu, Y. Shan, and S. Zhong (2023 b ): The Emergence of Economic Rationality of GPT, Working Paper

  9. [9]

    Fisman, D

    Choi, S., R. Fisman, D. Gale, and S. Kariv (2007): Consistency and Heterogeneity of Individual Behavior under Uncertainty, American Economic Review, 97, 1921--1938

  10. [10]

    Kariv, W

    Choi, S., S. Kariv, W. Müller, and D. Silverman (2014): Who Is (More) Rational? American Economic Review, 104, 1518--1550

  11. [11]

    Chow, A. R. (2023): How ChatGPT Managed to Grow Faster Than TikTok or Instagram, New York Times, accessed: 2023-12-15

  12. [12]

    D’Acunto, F. and A. G. Rossi (2023): Robo-Advice: Transforming Households into Rational Economic Agents, Annual Review of Financial Economics, 15, 543--563

  13. [13]

    Imai, and K

    Echenique, F., T. Imai, and K. Saito (2023): Approximate Expected Utility Rationalization, Journal of the European Economic Association (Forthcoming)

  14. [14]

    Manning, P

    Eloundou, T., S. Manning, P. Mishkin, and D. Rock (2023): GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models, Working Paper

  15. [15]

    (1991): A Theory of Disappointment Aversion, Econometrica, 59, 667--686

    Gul, F. (1991): A Theory of Disappointment Aversion, Econometrica, 59, 667--686

  16. [16]

    (2023): GPT Agents in Game Theory Experiments, Working Paper

    Guo, F. (2023): GPT Agents in Game Theory Experiments, Working Paper

  17. [17]

    Persitz, and L

    Halevy, Y., D. Persitz, and L. Zrill (2018): Parametric Recoverability of Preferences, Journal of Political Economy, 126, 1558--1593

  18. [18]

    Horton, J. J. (2023): Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? Working Paper 31122, National Bureau of Economic Research

  19. [19]

    J., S.-g

    Jansen, B. J., S.-g. Jung, and J. Salminen (2023): Employing large language models in survey research, Natural Language Processing Journal, 4, 100020

  20. [20]

    Ko, H. and J. Lee (2023): Can Chatgpt Improve Investment Decision? From a Portfolio Management Perspective, Working Paper

  21. [21]

    Kovács, M

    Le Mens, G., B. Kovács, M. T. Hannan, and G. Pros (2023): Uncovering the semantics of concepts using GPT-4, Proceedings of the National Academy of Sciences, 120, e2309350120

  22. [22]

    Li, C., J. Wang, Y. Zhang, K. Zhu, W. Hou, J. Lian, F. Luo, Q. Yang, and X. Xie (2023): Large Language Models Understand and Can be Enhanced by Emotional Stimuli, Working Paper

  23. [23]

    Lopez-Lira, A. and Y. Tang (2023): Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models, Working paper

  24. [24]

    Huang, and S

    Lu, F., L. Huang, and S. Li (2023): ChatGPT, Generative AI, and Investment Advisory, Working Paper

  25. [26]

    Niszczota, P. and S. Abbas (2023): GPT has become financially literate: Insights from financial literacy tests of GPT and a preliminary test of how people use it as a source of advice, Finance Research Letters, 58, 104333

  26. [27]

    Nori, H., Y. T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y. Li, W. Liu, et al. (2023): Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine, arXiv preprint arXiv:2311.16452

  27. [28]

    Noy, S. and W. Zhang (2023): Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence, Science, 381, 187--192

  28. [29]

    OpenAI (2023 a ): OpenAI API Reference, https://platform.openai.com/docs/api-reference/, accessed: 2023-12-15

  29. [30]

    --- -.1pt --- -.1pt --- (2023 b ): OpenAI Customer Stories, https://openai.com/customer-stories/, accessed: 2023-12-15

  30. [31]

    --- -.1pt --- -.1pt --- (2023 c ): OpenAI Documentation, https://platform.openai.com/docs/, accessed: 2023-12-15

  31. [32]

    (2000): Risk Aversion and Expected-Utility Theory: A Calibration Theorem, Econometrica, 68, 1281--1292

    Rabin, M. (2000): Risk Aversion and Expected-Utility Theory: A Calibration Theorem, Econometrica, 68, 1281--1292

  32. [33]

    Narayan, and R

    Romanko, O., A. Narayan, and R. H. Kwon (2023): ChatGPT-based Investment Portfolio Selection, Working Paper

  33. [34]

    Routledge, B. R. and S. E. Zin (2010): Generalized Disappointment Aversion and Asset Prices, The Journal of Finance, 65, 1303--1332

  34. [35]

    (2023): GPT-4 architecture, datasets, costs and more leaked, The Decoder

    Schreiner, M. (2023): GPT-4 architecture, datasets, costs and more leaked, The Decoder

  35. [36]

    Gopal, J

    Susarla, A., R. Gopal, J. B. Thatcher, and S. Sarker (2023): The Janus Effect of Generative AI: Charting the Path for Responsible Conduct of Scholarly Activities in Information Systems, Information Systems Research, 34, 399--408

  36. [37]

    Holyoak, and H

    Webb, T., K. Holyoak, and H. Lu (2023): Emergent analogical reasoning in large language models, Nature Human Behavior

  37. [38]

    Wiles, E. and J. J. Horton (2023): The Impact of AI Writing Assistance on Job Posts and the Supply of Jobs on an Online Labor Market, Working paper