Can an LLM Learn Preferences from Choice Data?
Pith reviewed 2026-05-24 04:52 UTC · model grok-4.3
The pith
Large language models improve at generating recommendations consistent with learned preferences as they observe more choices, though success varies by model and preference type.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proposes a portable Simulate-Recommend-Evaluate framework that tests preference learning from revealed-choice data by comparing LLM recommendations with optimal choices implied by known preference primitives of the disappointment aversion model. Recommendation accuracy improves as models observe more choices, but learning is heterogeneous across preference types and LLMs: GPT learns risk aversion better than disappointment aversion, Gemini performs best in high disappointment-aversion regions, and Claude shows the broadest effective learning across parameter regions.
What carries the argument
Simulate-Recommend-Evaluate framework that benchmarks LLM outputs against exact optimal choices from the disappointment aversion model.
If this is right
- Recommendation accuracy increases with the number of observed choices provided to the LLM.
- Different LLMs exhibit distinct patterns of effective learning across risk aversion and disappointment aversion parameters.
- LLMs can produce recommendations in unseen choice situations that align more closely with the underlying preference model after exposure to sufficient data.
- Heterogeneity implies that model selection may matter for applications involving specific preference structures.
Where Pith is reading between the lines
- The framework could be applied to other preference models beyond disappointment aversion to map LLM strengths more broadly.
- If heterogeneity persists, hybrid systems that route tasks to particular LLMs based on expected preference type might improve overall performance.
- Economic experiments using LLMs as stand-ins for human subjects would need to account for these model-specific biases in learning.
Load-bearing premise
The optimal choices implied by the disappointment aversion model's preference primitives can be computed exactly to provide an objective ground-truth benchmark.
What would settle it
LLM recommendation accuracy that stays flat or falls as the number of observed choices increases would contradict the reported improvement.
Figures
read the original abstract
Can large language models (LLMs) learn a decision maker's preferences from observed choices and generate preference-consistent recommendations in new situations? We propose a portable Simulate-Recommend-Evaluate framework that tests preference learning from revealed-choice data by comparing LLM recommendations with optimal choices implied by known preference primitives. We apply the framework to choice under uncertainty using the disappointment aversion model. Recommendation accuracy improves as models observe more choices, but learning is heterogeneous across preference types and LLMs: GPT learns risk aversion better than disappointment aversion, Gemini performs best in high disappointment-aversion regions, and Claude shows the broadest effective learning across parameter regions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Simulate-Recommend-Evaluate framework to test whether LLMs can infer preferences from revealed choice data and produce consistent recommendations in new settings. It applies the framework to choice under uncertainty using the disappointment aversion (DA) model of Gul (1991), comparing LLM outputs against optimal choices derived from known DA parameters. The central empirical claim is that recommendation accuracy rises with the number of observed choices, but learning is heterogeneous: GPT learns risk aversion better than disappointment aversion, Gemini excels in high-DA regions, and Claude shows the broadest learning across parameter space.
Significance. If the ground-truth benchmark is reliable and the heterogeneity patterns are robust, the portable framework could provide a replicable method for evaluating preference learning in LLMs against explicit economic primitives, with potential applications in AI-assisted decision making. The approach is notable for its use of independently computed optima rather than fitted parameters.
major comments (2)
- [Methods (benchmark computation)] Methods section describing the benchmark computation: the paper must specify the numerical procedure (fixed-point solver, optimization routine, or closed-form approximation) used to obtain the 'optimal choices implied by known preference primitives' for the DA model. Because DA certainty equivalents generally require solving a fixed-point equation involving the reference point and disappointment parameter β, any unreported tolerance, convergence criterion, or sensitivity analysis risks misattributing small benchmark errors to LLM performance differences, particularly in the heterogeneous regions highlighted for GPT, Gemini, and Claude.
- [Results (heterogeneity analysis)] Results on heterogeneity (e.g., the GPT vs. Gemini vs. Claude comparisons): without reported sample sizes per parameter region, choice-generation procedure, prompting details, or statistical tests for the accuracy differences, it is impossible to assess whether the reported patterns (GPT better on risk aversion, Gemini on high-DA) are statistically distinguishable from noise or from variation in the numerical ground truth.
minor comments (2)
- [Abstract] The abstract states results on accuracy and heterogeneity but supplies no information on choice generation, prompting, sample sizes, or controls; the full paper should ensure these details appear in the main text or a dedicated methods subsection.
- [Model description] Notation for the DA parameters (e.g., β, reference point) should be defined at first use and kept consistent with Gul (1991) to avoid ambiguity when describing the ground-truth computation.
Simulated Author's Rebuttal
Thank you for the constructive comments on the benchmark computation and heterogeneity analysis. We address each below.
read point-by-point responses
-
Referee: Methods section describing the benchmark computation: the paper must specify the numerical procedure (fixed-point solver, optimization routine, or closed-form approximation) used to obtain the 'optimal choices implied by known preference primitives' for the DA model. Because DA certainty equivalents generally require solving a fixed-point equation involving the reference point and disappointment parameter β, any unreported tolerance, convergence criterion, or sensitivity analysis risks misattributing small benchmark errors to LLM performance differences, particularly in the heterogeneous regions highlighted for GPT, Gemini, and Claude.
Authors: We agree that the numerical procedure must be fully specified to avoid potential misattribution of errors. The revised manuscript will add an explicit description of the fixed-point solver (including the iterative algorithm, tolerance of 1e-6, convergence criterion, and any sensitivity checks), ensuring the benchmark is transparent and reproducible. revision: yes
-
Referee: Results on heterogeneity (e.g., the GPT vs. Gemini vs. Claude comparisons): without reported sample sizes per parameter region, choice-generation procedure, prompting details, or statistical tests for the accuracy differences, it is impossible to assess whether the reported patterns (GPT better on risk aversion, Gemini on high-DA) are statistically distinguishable from noise or from variation in the numerical ground truth.
Authors: We acknowledge the need for these details to support statistical assessment. The revision will report sample sizes per region, the full choice-generation procedure, prompting templates, and formal statistical tests (e.g., pairwise proportion tests) for accuracy differences across LLMs and parameter regions. revision: yes
Circularity Check
No significant circularity; evaluation uses external model-derived ground truth
full rationale
The paper's Simulate-Recommend-Evaluate framework generates synthetic choice data from fixed disappointment-aversion parameters, supplies it to LLMs, and scores LLM outputs against optimal choices computed directly from those same known parameters. This structure treats the preference model as an independent benchmark rather than deriving the benchmark from LLM behavior or fitting LLM outputs back into the evaluation metric. No equations redefine predictions as inputs, no self-citations carry the central claim, and the comparison chain remains self-contained against the external model primitives.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Afriat, S. N. (1967): The Construction of Utility Functions from Expenditure Data, International Economic Review, 8, 67--77
work page 1967
-
[2]
--- -.1pt --- -.1pt --- (1973): On a System of Inequalities in Demand Analysis: An Extension of the Classical Method, International Economic Review, 14, 460--472
work page 1973
- [3]
-
[4]
Brookins, P. and J. DeBacker (2023): Playing games with GPT: What can we learn about a large language model from canonical strategic games, Working Paper
work page 2023
-
[5]
Chambers, C. P. and F. Echenique (2016): Revealed Preference Theory, Cambridge University Press
work page 2016
-
[6]
Charness, G., B. Jabarian, and J. A. List (2023): Generation Next: Experimentation with AI, Working Paper 31679, National Bureau of Economic Research
work page 2023
-
[7]
Chen, L., M. Zaharia, and J. Zou (2023 a ): How Is ChatGPT’s Behavior Changing over Time, Working Paper
work page 2023
-
[8]
Chen, Y., T. X. Liu, Y. Shan, and S. Zhong (2023 b ): The Emergence of Economic Rationality of GPT, Working Paper
work page 2023
- [9]
- [10]
-
[11]
Chow, A. R. (2023): How ChatGPT Managed to Grow Faster Than TikTok or Instagram, New York Times, accessed: 2023-12-15
work page 2023
-
[12]
D’Acunto, F. and A. G. Rossi (2023): Robo-Advice: Transforming Households into Rational Economic Agents, Annual Review of Financial Economics, 15, 543--563
work page 2023
-
[13]
Echenique, F., T. Imai, and K. Saito (2023): Approximate Expected Utility Rationalization, Journal of the European Economic Association (Forthcoming)
work page 2023
-
[14]
Eloundou, T., S. Manning, P. Mishkin, and D. Rock (2023): GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models, Working Paper
work page 2023
-
[15]
(1991): A Theory of Disappointment Aversion, Econometrica, 59, 667--686
Gul, F. (1991): A Theory of Disappointment Aversion, Econometrica, 59, 667--686
work page 1991
-
[16]
(2023): GPT Agents in Game Theory Experiments, Working Paper
Guo, F. (2023): GPT Agents in Game Theory Experiments, Working Paper
work page 2023
-
[17]
Halevy, Y., D. Persitz, and L. Zrill (2018): Parametric Recoverability of Preferences, Journal of Political Economy, 126, 1558--1593
work page 2018
-
[18]
Horton, J. J. (2023): Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? Working Paper 31122, National Bureau of Economic Research
work page 2023
- [19]
-
[20]
Ko, H. and J. Lee (2023): Can Chatgpt Improve Investment Decision? From a Portfolio Management Perspective, Working Paper
work page 2023
- [21]
-
[22]
Li, C., J. Wang, Y. Zhang, K. Zhu, W. Hou, J. Lian, F. Luo, Q. Yang, and X. Xie (2023): Large Language Models Understand and Can be Enhanced by Emotional Stimuli, Working Paper
work page 2023
-
[23]
Lopez-Lira, A. and Y. Tang (2023): Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models, Working paper
work page 2023
-
[24]
Lu, F., L. Huang, and S. Li (2023): ChatGPT, Generative AI, and Investment Advisory, Working Paper
work page 2023
-
[26]
Niszczota, P. and S. Abbas (2023): GPT has become financially literate: Insights from financial literacy tests of GPT and a preliminary test of how people use it as a source of advice, Finance Research Letters, 58, 104333
work page 2023
- [27]
-
[28]
Noy, S. and W. Zhang (2023): Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence, Science, 381, 187--192
work page 2023
-
[29]
OpenAI (2023 a ): OpenAI API Reference, https://platform.openai.com/docs/api-reference/, accessed: 2023-12-15
work page 2023
-
[30]
--- -.1pt --- -.1pt --- (2023 b ): OpenAI Customer Stories, https://openai.com/customer-stories/, accessed: 2023-12-15
work page 2023
-
[31]
--- -.1pt --- -.1pt --- (2023 c ): OpenAI Documentation, https://platform.openai.com/docs/, accessed: 2023-12-15
work page 2023
-
[32]
Rabin, M. (2000): Risk Aversion and Expected-Utility Theory: A Calibration Theorem, Econometrica, 68, 1281--1292
work page 2000
-
[33]
Romanko, O., A. Narayan, and R. H. Kwon (2023): ChatGPT-based Investment Portfolio Selection, Working Paper
work page 2023
-
[34]
Routledge, B. R. and S. E. Zin (2010): Generalized Disappointment Aversion and Asset Prices, The Journal of Finance, 65, 1303--1332
work page 2010
-
[35]
(2023): GPT-4 architecture, datasets, costs and more leaked, The Decoder
Schreiner, M. (2023): GPT-4 architecture, datasets, costs and more leaked, The Decoder
work page 2023
- [36]
-
[37]
Webb, T., K. Holyoak, and H. Lu (2023): Emergent analogical reasoning in large language models, Nature Human Behavior
work page 2023
-
[38]
Wiles, E. and J. J. Horton (2023): The Impact of AI Writing Assistance on Job Posts and the Supply of Jobs on an Online Labor Market, Working paper
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.