pith. sign in

arxiv: 2602.01022 · v3 · submitted 2026-02-01 · 💰 econ.GN · cs.AI· q-fin.EC

Calibrating Behavioral Parameters with Large Language Models

Pith reviewed 2026-05-16 08:54 UTC · model grok-4.3

classification 💰 econ.GN cs.AIq-fin.EC
keywords behavioral financelarge language modelsloss aversionherdingextrapolationagent-based modelscalibrationasset pricing
0
0 comments X

The pith

Large language models can be calibrated with behavioral profiles to measure loss aversion, herding, and extrapolation at or above human benchmark levels for asset pricing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that uses large language models as measurement instruments for behavioral parameters that are hard to observe directly in asset markets. It first documents that baseline LLM responses display too much rationality, with weaker loss aversion, herding, and disposition effects than human data. Profile-based prompting then produces large, stable shifts that bring the parameters for loss aversion, herding, extrapolation, and anchoring into or beyond observed human ranges. When the calibrated extrapolation parameter is inserted into an agent-based asset pricing model, the resulting price paths display short-horizon momentum and long-horizon reversal that match empirical patterns.

Core claim

Profile-based calibration of LLMs induces large, stable, and theoretically coherent shifts in behavioral parameters, with calibrated loss aversion, herding, extrapolation, and anchoring reaching or exceeding benchmark magnitudes, and calibrated extrapolation in an agent-based asset pricing model generates short-horizon momentum and long-horizon reversal patterns consistent with empirical evidence.

What carries the argument

Profile-based prompting that treats LLMs as calibrated measurement instruments for eight canonical behavioral biases.

If this is right

  • Calibrated parameters reach or exceed human benchmark magnitudes for loss aversion, herding, extrapolation, and anchoring.
  • Calibrated extrapolation in an agent-based asset pricing model produces short-horizon momentum and long-horizon reversal consistent with empirical evidence.
  • The framework supplies explicit measurement ranges and boundaries for eight canonical behavioral biases.
  • Baseline LLM behavior exhibits systematic rationality bias including attenuated loss aversion and near-zero disposition effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could allow researchers to generate large populations of heterogeneous agents with controlled bias profiles without new surveys or experiments.
  • If the calibration functions prove stable, they could be reused across different market models to study interactions among multiple biases simultaneously.
  • The method might extend to calibrating behavioral parameters in macroeconomic or policy simulation models where direct measurement is equally difficult.

Load-bearing premise

That prompting LLMs with behavioral profiles produces parameters that remain stable across models, scenarios, and time and that inserting those parameters into agent-based models yields dynamics that reflect human behavior rather than artifacts of the prompting process.

What would settle it

Running the same profile prompts on multiple LLMs at different times and finding that the extracted parameters for loss aversion or extrapolation vary by more than the reported stability margin, or finding that the agent-based model with calibrated extrapolation fails to produce momentum and reversal patterns when tested against new market data.

read the original abstract

Behavioral parameters such as loss aversion, herding, and extrapolation are central to asset pricing models but remain difficult to measure reliably. We develop a framework that treats large language models (LLMs) as calibrated measurement instruments for behavioral parameters. Using four models and 24{,}000 agent--scenario pairs, we document systematic rationality bias in baseline LLM behavior, including attenuated loss aversion, weak herding, and near-zero disposition effects relative to human benchmarks. Profile-based calibration induces large, stable, and theoretically coherent shifts in several parameters, with calibrated loss aversion, herding, extrapolation, and anchoring reaching or exceeding benchmark magnitudes. To assess external validity, we embed calibrated parameters in an agent-based asset pricing model, where calibrated extrapolation generates short-horizon momentum and long-horizon reversal patterns consistent with empirical evidence. Our results establish measurement ranges, calibration functions, and explicit boundaries for eight canonical behavioral biases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper develops a framework treating LLMs as calibrated instruments for measuring behavioral parameters (loss aversion, herding, extrapolation, anchoring) in asset pricing. Using four models and 24,000 agent-scenario pairs, it reports baseline LLM rationality biases relative to human benchmarks, shows profile-based calibration produces large, stable, theoretically coherent parameter shifts reaching or exceeding benchmarks, and validates by embedding calibrated parameters in an agent-based asset pricing model where extrapolation generates short-horizon momentum and long-horizon reversal consistent with empirical evidence. The work establishes measurement ranges and explicit boundaries for eight biases.

Significance. If the calibration functions prove robust, the approach could supply a scalable method for quantifying parameters that are otherwise difficult to measure directly, improving the micro-foundations of agent-based models in finance. The multi-model design and ABM embedding step are constructive elements that ground the claims in both measurement and dynamic implications.

major comments (3)
  1. [Abstract] Abstract: the central claim that profile-based calibration induces 'large, stable, and theoretically coherent shifts' rests on summarized outcomes; the abstract supplies no prompting protocols, statistical significance tests, robustness checks to model choice, or exclusion criteria, leaving the reliability of the calibration functions unevaluated.
  2. [Results] Results section: results are reported from four LLMs and 24,000 pairs but no cross-model variance statistics or consistency metrics for the calibration mappings are provided, so the invariance assumption required for treating LLMs as stable measurement instruments remains untested.
  3. [ABM validation] ABM validation section: calibrated extrapolation is shown to generate momentum and reversal patterns 'consistent with empirical evidence,' yet both the calibration targets (human benchmarks) and the validation targets (market patterns) derive from observed human behavior, creating a moderate circularity risk that weakens the external-validity interpretation.
minor comments (2)
  1. [Abstract] The abstract states '24,000' without a comma; adopt consistent numeric formatting throughout.
  2. [Introduction] Define the eight canonical behavioral biases with explicit functional forms or references in the main text before presenting calibration results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below. Revisions have been made to strengthen the presentation of methods, add cross-model statistics, and clarify the validation logic. We believe these changes improve the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that profile-based calibration induces 'large, stable, and theoretically coherent shifts' rests on summarized outcomes; the abstract supplies no prompting protocols, statistical significance tests, robustness checks to model choice, or exclusion criteria, leaving the reliability of the calibration functions unevaluated.

    Authors: We agree the abstract is highly condensed. Due to length limits, it summarizes rather than details protocols. Prompting templates, exact statistical tests (t-tests and Wilcoxon on parameter shifts), model-robustness tables, and exclusion rules (e.g., responses with <80% coherence) are fully reported in Sections 2.2, 3.1, and 4.1. We have revised the abstract to add one sentence noting the four-model design, 24,000-pair sample, and robustness across LLMs. revision: yes

  2. Referee: [Results] Results section: results are reported from four LLMs and 24,000 pairs but no cross-model variance statistics or consistency metrics for the calibration mappings are provided, so the invariance assumption required for treating LLMs as stable measurement instruments remains untested.

    Authors: We accept this point. The original draft reported only pooled results. We have added a new subsection (3.3) that computes (i) standard deviation of each calibrated parameter across the four models, (ii) pairwise correlations of the calibration functions, and (iii) a consistency index (fraction of parameters whose sign and significance agree across models). These metrics are low-variance for loss aversion, herding, and extrapolation, supporting the invariance assumption. The revised tables are now included. revision: yes

  3. Referee: [ABM validation] ABM validation section: calibrated extrapolation is shown to generate momentum and reversal patterns 'consistent with empirical evidence,' yet both the calibration targets (human benchmarks) and the validation targets (market patterns) derive from observed human behavior, creating a moderate circularity risk that weakens the external-validity interpretation.

    Authors: We disagree that this constitutes circularity. Calibration targets are micro-level parameters recovered from controlled laboratory experiments (Kahneman & Tversky 1979; Barberis et al. 2016). Validation targets are macro-level return patterns documented in asset-pricing studies (Jegadeesh & Titman 1993; De Bondt & Thaler 1985). The ABM tests whether parameters fitted to individual experimental data can reproduce aggregate market regularities—an explicit micro-to-macro mapping that is not tautological. We have added a clarifying paragraph in Section 5.2 distinguishing the two data sources and noting that market patterns were never used in calibration. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper calibrates LLM outputs to match external human behavioral benchmarks for parameters such as loss aversion and extrapolation, then embeds the resulting values in a standard agent-based asset pricing model to check whether they reproduce known aggregate market patterns (short-horizon momentum, long-horizon reversal). These steps are independent: the calibration targets are micro-level individual biases drawn from separate human-subject studies, while the validation targets are macro-level price dynamics from market data. No equations, definitions, or self-citations reduce any claimed result to its own inputs by construction. The framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can be made to proxy human behavioral parameters through prompting and profile calibration, with no new physical entities introduced and only one class of free parameters (the profile descriptors).

free parameters (1)
  • profile descriptors for calibration
    Short textual profiles added to prompts to shift LLM responses toward human benchmark magnitudes; their exact content and selection criteria are not specified in the abstract.
axioms (1)
  • domain assumption LLMs can simulate human-like decision biases when appropriately prompted and calibrated
    Invoked when the authors treat baseline and calibrated LLM outputs as direct measurements of behavioral parameters.

pith-pipeline@v0.9.0 · 5445 in / 1456 out tokens · 34145 ms · 2026-05-16T08:54:10.811978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Using large language models to simulate multiple humans and replicate human subject studies.International Conference on Machine Learning, pages 337–371, 2023

    Gati Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies.International Conference on Machine Learning, pages 337–371, 2023

  2. [2]

    Information cascades in the laboratory.American Economic Review, 87(5): 847–862, 1997

    Lisa R Anderson and Charles A Holt. Information cascades in the laboratory.American Economic Review, 87(5): 847–862, 1997

  3. [3]

    Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023

    Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023

  4. [4]

    Stocks as lotteries: The implications of probability weighting for security prices.American Economic Review, 98(5):2066–2100, 2008

    Nicholas Barberis and Ming Huang. Stocks as lotteries: The implications of probability weighting for security prices.American Economic Review, 98(5):2066–2100, 2008

  5. [5]

    A model of investor sentiment.Journal of Financial Economics, 49(3):307–343, 1998

    Nicholas Barberis, Andrei Shleifer, and Robert Vishny. A model of investor sentiment.Journal of Financial Economics, 49(3):307–343, 1998

  6. [6]

    Extrapolation and bubbles.Journal of Financial Economics, 129(2):203–227, 2018

    Nicholas Barberis, Robin Greenwood, Lawrence Jin, and Andrei Shleifer. Extrapolation and bubbles.Journal of Financial Economics, 129(2):203–227, 2018

  7. [7]

    Thirty years of prospect theory in economics: A review and assessment.Journal of Economic Perspectives, 27(1):173–196, 2013

    Nicholas C Barberis. Thirty years of prospect theory in economics: A review and assessment.Journal of Economic Perspectives, 27(1):173–196, 2013

  8. [8]

    Post-earnings-announcement drift: Delayed price response or risk premium?Journal of Accounting Research, 27:1–36, 1989

    Victor L Bernard and Jacob K Thomas. Post-earnings-announcement drift: Delayed price response or risk premium?Journal of Accounting Research, 27:1–36, 1989

  9. [9]

    A theory of fads, fashion, custom, and cultural change as informational cascades.Journal of Political Economy, 100(5):992–1026, 1992

    Sushil Bikhchandani, David Hirshleifer, and Ivo Welch. A theory of fads, fashion, custom, and cultural change as informational cascades.Journal of Political Economy, 100(5):992–1026, 1992

  10. [10]

    Predicting the next step of a random walk: Experimental evidence of regime-shifting beliefs.Journal of Financial Economics, 65(3):397–414, 2002

    Robert Bloomfield and Jeffrey Hales. Predicting the next step of a random walk: Experimental evidence of regime-shifting beliefs.Journal of Financial Economics, 65(3):397–414, 2002

  11. [11]

    Using gpt for market research

    James Brand, Ayelet Israeli, and Donald Ngwe. Using gpt for market research. Marketing Unit Working Paper 23-062, Harvard Business School, 2023

  12. [12]

    Heterogeneous beliefs and routes to chaos in a simple asset pricing model

    William A Brock and Cars H Hommes. Heterogeneous beliefs and routes to chaos in a simple asset pricing model. Journal of Economic Dynamics and Control, 22(8-9):1235–1274, 1998

  13. [13]

    The promise and success of lab-field generalizability in experimental economics: A critical reply to levitt and list.Available at SSRN 1977749, 2011

    Colin F Camerer. The promise and success of lab-field generalizability in experimental economics: A critical reply to levitt and list.Available at SSRN 1977749, 2011

  14. [14]

    Distinguishing informational cascades from herd behavior in the laboratory

    Bogachan Celen and Shachar Kariv. Distinguishing informational cascades from herd behavior in the laboratory. American Economic Review, 94(3):484–498, 2004

  15. [15]

    Investor psychology and security market under-and overreactions.Journal of Finance, 53(6):1839–1885, 1998

    Kent Daniel, David Hirshleifer, and Avanidhar Subrahmanyam. Investor psychology and security market under-and overreactions.Journal of Finance, 53(6):1839–1885, 1998

  16. [16]

    Individual risk attitudes: Measurement, determinants, and behavioral consequences.Journal of the European Economic Association, 9(3):522–550, 2011

    Thomas Dohmen, Armin Falk, David Huffman, Uwe Sunde, Jürgen Schupp, and Gert G Wagner. Individual risk attitudes: Measurement, determinants, and behavioral consequences.Journal of the European Economic Association, 9(3):522–550, 2011

  17. [17]

    Expectations of returns and expected returns.Review of Financial Studies, 27(3):714–746, 2014

    Robin Greenwood and Andrei Shleifer. Expectations of returns and expected returns.Review of Financial Studies, 27(3):714–746, 2014

  18. [18]

    Risk aversion and incentive effects.American Economic Review, 92(5): 1644–1655, 2002

    Charles A Holt and Susan K Laury. Risk aversion and incentive effects.American Economic Review, 92(5): 1644–1655, 2002

  19. [19]

    Large language models as simulated economic agents: What can we learn from homo silicus? National Bureau of Economic Research Working Paper, (31122), 2023

    John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? National Bureau of Economic Research Working Paper, (31122), 2023

  20. [20]

    Can large language models simulate human behavior in economic experiments?Working Paper, 2024

    John J Horton. Can large language models simulate human behavior in economic experiments?Working Paper, 2024

  21. [21]

    Returns to buying winners and selling losers: Implications for stock market efficiency.Journal of Finance, 48(1):65–91, 1993

    Narasimhan Jegadeesh and Sheridan Titman. Returns to buying winners and selling losers: Implications for stock market efficiency.Journal of Finance, 48(1):65–91, 1993

  22. [22]

    Prospect theory: An analysis of decision under risk.Econometrica, 47(2): 263–291, 1979

    Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under risk.Econometrica, 47(2): 263–291, 1979

  23. [23]

    Experimental tests of the endowment effect and the coase theorem.Journal of Political Economy, 98(6):1325–1348, 1990

    Daniel Kahneman, Jack L Knetsch, and Richard H Thaler. Experimental tests of the endowment effect and the coase theorem.Journal of Political Economy, 98(6):1325–1348, 1990

  24. [24]

    Structural vs

    Michael P Keane. Structural vs. atheoretic approaches to econometrics.Journal of Econometrics, 156(1):3–20, 2011. 11 Large Language Models as Calibrated Measurement Instruments for Behavioral ParametersA PREPRINT

  25. [25]

    Contrarian investment, extrapolation, and risk.Journal of Finance, 49(5):1541–1578, 1994

    Josef Lakonishok, Andrei Shleifer, and Robert W Vishny. Contrarian investment, extrapolation, and risk.Journal of Finance, 49(5):1541–1578, 1994

  26. [26]

    Empirical regularities from interacting long-and short-memory investors in an agent-based stock market.IEEE Transactions on Evolutionary Computation, 5(5):442–455, 2001

    Blake LeBaron. Empirical regularities from interacting long-and short-memory investors in an agent-based stock market.IEEE Transactions on Evolutionary Computation, 5(5):442–455, 2001

  27. [27]

    Scaling and criticality in a stochastic multi-agent model of a financial market

    Thomas Lux and Michele Marchesi. Scaling and criticality in a stochastic multi-agent model of a financial market. Nature, 397(6719):498–500, 1999

  28. [28]

    Quantifying and mitigating memorization in large language models.arXiv preprint, 2024

    Qing Mei et al. Quantifying and mitigating memorization in large language models.arXiv preprint, 2024

  29. [29]

    The trouble with overconfidence.Psychological Review, 115(2):502–517, 2008

    Don A Moore and Paul J Healy. The trouble with overconfidence.Psychological Review, 115(2):502–517, 2008

  30. [30]

    Overcoming the inevitable anchoring effect: Considering the opposite compensates for selective accessibility.Personality and Social Psychology Bulletin, 26(9):1142–1150, 2000

    Thomas Mussweiler, Fritz Strack, and Tim Pfeiffer. Overcoming the inevitable anchoring effect: Considering the opposite compensates for selective accessibility.Personality and Social Psychology Bulletin, 26(9):1142–1150, 2000

  31. [31]

    Experts, amateurs, and real estate: An anchoring-and-adjustment perspective on property pricing decisions.Organizational Behavior and Human Decision Processes, 39(1):84–97, 1987

    Gregory B Northcraft and Margaret A Neale. Experts, amateurs, and real estate: An anchoring-and-adjustment perspective on property pricing decisions.Organizational Behavior and Human Decision Processes, 39(1):84–97, 1987

  32. [32]

    The boundaries of loss aversion.Journal of Marketing Research, 42(2): 119–128, 2005

    Nathan Novemsky and Daniel Kahneman. The boundaries of loss aversion.Journal of Marketing Research, 42(2): 119–128, 2005

  33. [33]

    Are investors reluctant to realize their losses?Journal of Finance, 53(5):1775–1798, 1998

    Terrance Odean. Are investors reluctant to realize their losses?Journal of Finance, 53(5):1775–1798, 1998

  34. [34]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InUIST, 2023

  35. [35]

    The disposition to sell winners too early and ride losers too long: Theory and evidence.Journal of Finance, 40(3):777–790, 1985

    Hersh Shefrin and Meir Statman. The disposition to sell winners too early and ride losers too long: Theory and evidence.Journal of Finance, 40(3):777–790, 1985

  36. [36]

    Judgment under uncertainty: Heuristics and biases.Science, 185(4157): 1124–1131, 1974

    Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases.Science, 185(4157): 1124–1131, 1974

  37. [37]

    Advances in prospect theory: Cumulative representation of uncertainty

    Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4):297–323, 1992

  38. [38]

    Asset M891 (+20%) just missed earnings 40%, lost major contract, deteriorating margins. Asset P234 (-10%) performing as expected. Which sell?

    Martin Weber and Colin F Camerer. The disposition effect in securities trading: An experimental analysis.Journal of Economic Behavior & Organization, 33(2):167–184, 1998. 12 Large Language Models as Calibrated Measurement Instruments for Behavioral ParametersA PREPRINT Appendix A Human Benchmark Justification This appendix provides detailed justification ...

  39. [39]

    Search each asset identifier on Google (exact phrase match)

  40. [40]

    Search on Bing, Yahoo Finance, Bloomberg Terminal

  41. [41]

    Search SEC EDGAR filings

  42. [42]

    Procedure documented and replicable

    Search financial news archives (WSJ, FT, Bloomberg News) Zero exact matches confirm non-existence in accessible training data. Procedure documented and replicable. E.2 Power Analysis Details For each experiment, we compute power using simulation-based approach: Disposition Effect: • Null: DR = 1.0 (no bias) • Alternative: DR = 1.6 (human benchmark) • Samp...