Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation

Matthew Francis Dixon

arxiv: 2606.17383 · v1 · pith:BBENICIMnew · submitted 2026-06-16 · 💱 q-fin.RM · cs.AI· cs.LG· stat.ML

Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation

Matthew Francis Dixon This is my paper

Pith reviewed 2026-06-26 22:07 UTC · model grok-4.3

classification 💱 q-fin.RM cs.AIcs.LGstat.ML

keywords agentic AIPOMDPmodel validationmodel risk managementbelief calibrationportfolio managementBlack-Litterman

0 comments

The pith

A POMDP framework decomposes agentic AI decisions into separate belief, forecast, and policy components for independent validation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a validation approach for autonomous AI agents that goes beyond checking prediction accuracy alone. It models decision processes as Partially Observable Markov Decision Processes to isolate information gathering, belief formation about hidden states, forecasting, action selection, and utility assessment. Each piece can then be examined on its own terms. Large language models are treated as approximate operators that update beliefs from new data. A portfolio-management example applies the approach to regime inference and Black-Litterman portfolio construction, showing that belief quality affects outcomes separately from policy rules.

Core claim

The POMDP framework decomposes autonomous decision making into information, beliefs, forecasts, actions, and utility, allowing each component to be validated independently. Large language models are formalized as approximate Bayesian filtering operators, and a model-risk taxonomy is developed encompassing state-space, filtering, forecast, policy, utility-specification, and parameter risks. Empirical results in the portfolio case study indicate that latent-state inference contributes independently to decision quality and that the principal conclusions remain robust across a broad range of parameter values.

What carries the argument

POMDP decomposition of agentic processes into information acquisition, belief-state filtering, conditional forecasts, policy selection, and utility evaluation.

Load-bearing premise

Large language models can be treated as approximate Bayesian filtering operators that maintain and update beliefs over latent states.

What would settle it

An ablation in the portfolio example that removes the belief-state inference step and produces no measurable change in out-of-sample portfolio performance or risk metrics would undermine the claim that separate belief validation adds value.

Figures

Figures reproduced from arXiv: 2606.17383 by Matthew Francis Dixon.

**Figure 2.** Figure 2: Portfolio drawdowns. The Forecasting POMDP exhibits the smallest drawdown among all strategies considered. This result suggests that the latent-state framework contributes useful information regarding downside risk and adverse market environments. The figure therefore provides validation evidence for the policy layer of the framework. 6.8 Policy Validation Through Wealth Evolution [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 3.** Figure 3: Cumulative wealth trajectories. Although the equal weight portfolio achieves the highest terminal wealth, the Forecasting POMDP remains competitive while producing substantially superior risk-adjusted performance. This distinction illustrates the difference between maximizing return and maximizing utility. 6.9 Ablation Study Ablation analysis provides one of the most important validation tests in the paper… view at source ↗

read the original abstract

Agentic artificial intelligence systems introduce a new class of model risk. Unlike traditional predictive models, autonomous agents continuously acquire information, form beliefs regarding latent states of the environment, generate forecasts, select actions, and adapt their behavior over time. Existing validation methodologies focus primarily on predictive accuracy and therefore provide limited insight into the quality of the underlying decision process. This paper proposes a model validation framework for agentic AI based on Partially Observable Markov Decision Processes (POMDPs). The framework decomposes autonomous decision making into information, beliefs, forecasts, actions, and utility, allowing each component to be validated independently. Large language models (LLMs) are formalized as approximate Bayesian filtering operators, and a model-risk taxonomy is developed encompassing state-space, filtering, forecast, policy, utility-specification, and parameter risks. The model risk validation methodology is demonstrated through a portfolio-management case study in which an agent infers latent market regimes from market and macroeconomic information, generates belief-conditioned forecasts, and constructs portfolios using a Black--Litterman framework. Empirical validation combines performance analysis, belief calibration diagnostics, coverage tests, ablation studies, and parameter-sensitivity analysis. The results indicate that latent-state inference contributes independently to decision quality and that the principal conclusions remain robust across a broad range of parameter values. The principal contribution of the paper is a practical framework for extending established model risk management concepts to autonomous AI systems and providing a rigorous foundation for their validation, governance, and monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The POMDP framework organizes agentic AI validation into separate components and adds a model-risk taxonomy, but the LLM-as-Bayesian-filter step lacks the required mapping.

read the letter

The paper's core move is to decompose agent validation into information, beliefs, forecasts, policies, and utility using POMDPs, then build a taxonomy of risks for each piece. It illustrates this on a portfolio agent that infers latent market regimes from data and applies Black-Litterman portfolios.

The useful part is the explicit separation of validation tasks and the inclusion of belief calibration diagnostics, coverage tests, ablations, and parameter sweeps in the case study. These checks go beyond return metrics and try to show that the latent-state step contributes on its own.

The soft spot is exactly the one flagged in the stress-test note. The paper states that LLMs act as approximate Bayesian filtering operators but supplies no derivation or verification that next-token probabilities correspond to a posterior over well-defined latent states. If the LLM output is pattern matching rather than an explicit belief update, the claimed independence between filtering risk and forecast risk does not follow, and component-wise validation loses its license. The portfolio results may still be informative as an example, but they do not rescue the general formalization.

This is for model-risk and governance teams working with autonomous systems in finance. It deserves peer review because the structure is practical and the empirical checks are more detailed than typical framework papers, even if the central modeling assumption needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper proposes a POMDP-based model validation framework for agentic AI systems that decomposes autonomous decision processes into information acquisition, belief formation, forecasting, policy selection, and utility evaluation to enable component-wise validation. LLMs are formalized as approximate Bayesian filtering operators, a model-risk taxonomy (state-space, filtering, forecast, policy, utility-specification, and parameter risks) is introduced, and the approach is illustrated via a portfolio-management case study in which an agent infers latent market regimes, produces belief-conditioned forecasts, and constructs Black-Litterman portfolios. Empirical results from performance analysis, belief calibration, coverage tests, ablations, and sensitivity analysis are reported to show that latent-state inference contributes independently to decision quality and that conclusions are robust across parameter ranges.

Significance. If the LLM-as-Bayesian-filter formalization and the resulting POMDP decomposition can be rigorously established, the framework would supply a structured extension of existing model-risk-management practices to autonomous agents, supporting independent validation of each decision component and a taxonomy for ongoing governance and monitoring. The portfolio case study provides an initial demonstration that such decomposition can be operationalized and tested empirically.

major comments (2)

[Abstract] Abstract (and the central framework claim): the formalization of LLMs as approximate Bayesian filtering operators is asserted without an explicit observation model, likelihood function, or derivation showing that the next-token predictive distribution equals or approximates the posterior p(s_t | o_{1:t}). This mapping is load-bearing for the POMDP tuple (information, belief, forecast, policy, utility) and for the claim that filtering risk can be validated separately from forecast risk; absent the mapping, the component-wise validation license does not follow.
[Case study] Portfolio case study section: the latent market regime is inferred from market and macro data, yet the manuscript must demonstrate that the LLM's output is performing an explicit belief update rather than pattern-matched forecasting; if the latter, the claimed separation between filtering risk and forecast risk collapses and the independent-validation results no longer license the POMDP decomposition.

minor comments (1)

[Abstract] The abstract states that 'empirical validation combines performance analysis, belief calibration diagnostics, coverage tests, ablation studies, and parameter-sensitivity analysis' but does not indicate which specific diagnostics (e.g., PIT histograms, reliability diagrams, or proper scoring rules) are used for belief calibration; this should be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important points regarding the rigor of the LLM formalization and the empirical separation of filtering from forecasting. We address each major comment below and commit to revisions that strengthen these aspects without altering the core claims of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (and the central framework claim): the formalization of LLMs as approximate Bayesian filtering operators is asserted without an explicit observation model, likelihood function, or derivation showing that the next-token predictive distribution equals or approximates the posterior p(s_t | o_{1:t}). This mapping is load-bearing for the POMDP tuple (information, belief, forecast, policy, utility) and for the claim that filtering risk can be validated separately from forecast risk; absent the mapping, the component-wise validation license does not follow.

Authors: We agree that an explicit derivation is needed to support the separation of filtering risk. The revised manuscript will add a dedicated subsection (new Section 2.3) that defines an implicit observation model for token-level predictions, specifies the likelihood as the next-token distribution conditioned on state, and derives the approximation to the posterior update p(s_t | o_{1:t}) via the standard Bayesian filtering recursion. This will make the POMDP decomposition and independent validation claims fully rigorous. revision: yes
Referee: [Case study] Portfolio case study section: the latent market regime is inferred from market and macro data, yet the manuscript must demonstrate that the LLM's output is performing an explicit belief update rather than pattern-matched forecasting; if the latter, the claimed separation between filtering risk and forecast risk collapses and the independent-validation results no longer license the POMDP decomposition.

Authors: The existing ablation studies and belief-calibration diagnostics already indicate that latent-state inference contributes independently to out-of-sample performance. To directly address the pattern-matching concern, the revised case-study section will add a controlled synthetic-data experiment in which the true posterior is known; we will compare the LLM's sequential outputs against an exact Bayesian filter and report the resulting divergence metrics. This will provide explicit evidence that the mechanism is an approximate update rather than pure pattern matching. revision: yes

Circularity Check

0 steps flagged

No circularity: framework applies standard POMDP decomposition to new domain via explicit modeling choice

full rationale

The paper proposes a validation framework by decomposing agentic decision processes into information, beliefs, forecasts, actions, and utility components using the established POMDP tuple, then formalizes LLMs as approximate Bayesian filtering operators as a direct modeling assumption. No derivation chain reduces a claimed result to a fitted parameter or self-citation by construction; the case study employs independent empirical checks (performance analysis, calibration diagnostics, coverage tests, ablation studies) whose validity does not presuppose the framework outputs. The central contribution is an application of existing concepts rather than a self-referential prediction or renamed input.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; limited visibility into parameters or assumptions beyond the stated formalization of LLMs.

axioms (1)

domain assumption Large language models (LLMs) can be formalized as approximate Bayesian filtering operators
This formalization is invoked in the abstract to connect LLMs to the POMDP belief-state component.

pith-pipeline@v0.9.1-grok · 5803 in / 1157 out tokens · 30316 ms · 2026-06-26T22:07:13.055155+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 8 canonical work pages · 6 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., et al. Constitutional AI: Harmlessness from AI Feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Principles for Effective Risk Data Aggregation and Risk Reporting

Basel Committee on Banking Supervision. Principles for Effective Risk Data Aggregation and Risk Reporting. BCBS 239, Bank for International Settlements, Basel, Switzerland, January
[3]

Available at:https://www.bis.org/publ/bcbs239.htm
[4]

Princeton University Press, 1957

Bellman, R.Dynamic Programming. Princeton University Press, 1957

1957
[5]

O.Statistical Decision Theory and Bayesian Analysis

Berger, J. O.Statistical Decision Theory and Bayesian Analysis. Springer, 1985

1985
[6]

Bernardo, J. M. and Smith, A. F. M.Bayesian Theory. Wiley, 2000. 25

2000
[7]

P.Dynamic Programming and Optimal Control

Bertsekas, D. P.Dynamic Programming and Optimal Control. Athena Scientific, 1995

1995
[8]

and Litterman, R

Black, F. and Litterman, R. Global Portfolio Optimization.Financial Analysts Journal, 48(5):28–43, 1992

1992
[9]

Supervisory Guidance on Model Risk Management (SR 11-7)

Board of Governors of the Federal Reserve System and Office of the Comptroller of the Cur- rency. Supervisory Guidance on Model Risk Management (SR 11-7). Federal Reserve System, Washington, DC, 2011. Available at:https://www.federalreserve.gov/supervisionreg/ srletters/sr1107.htm

2011
[10]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D., Adeli, E., Altman, R., Arora, S., von Arx, S., et al. On the Opportunities and Risks of Foundation Models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Brier, G. W. Verification of Forecasts Expressed in Terms of Probability.Monthly Weather Review, 78(1):1–3, 1950

1950
[12]

Cassandra, A. R. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. PhD Thesis, Brown University, 1998

1998
[13]

Cover, T. M. and Thomas, J. A.Elements of Information Theory. Wiley, 2006

2006
[14]

Dawid, A. P. The Well-Calibrated Bayesian.Journal of the American Statistical Association, 77(379):605–610, 1982

1982
[15]

F., Halperin, I., and Bilokon, P.Machine Learning in Finance: From Theory to Practice

Dixon, M. F., Halperin, I., and Bilokon, P.Machine Learning in Finance: From Theory to Practice. Springer, 2020

2020
[16]

J., Aggoun, L., and Moore, J

Elliott, R. J., Aggoun, L., and Moore, J. B.Hidden Markov Models: Estimation and Control. Springer, 1995

1995
[17]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., et al. Red Teaming Language Models to Reduce Harms.arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

and Raftery, A

Gneiting, T. and Raftery, A. E. Strictly Proper Scoring Rules, Prediction and Estimation. Journal of the American Statistical Association, 102(477):359–378, 2007

2007
[19]

Grinold, R. C. and Kahn, R. N.Active Portfolio Management. McGraw-Hill, 2000

2000
[20]

Hamilton, J. D. A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle.Econometrica, 57(2):357–384, 1989

1989
[21]

T.Probability Theory: The Logic of Science

Jaynes, E. T.Probability Theory: The Logic of Science. Cambridge University Press, 2003

2003
[22]

P., Littman, M

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Planning and Acting in Partially Observable Stochastic Domains.Artificial Intelligence, 101(1–2):99–134, 1998. 26

1998
[23]

Holistic Evaluation of Language Models

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., et al. Holistic Evaluation of Language Models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios and Capital Budgets.Review of Economics and Statistics, 47(1):13–37, 1965

Lintner, J. The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios and Capital Budgets.Review of Economics and Statistics, 47(1):13–37, 1965

1965
[25]

Portfolio Selection.Journal of Finance, 7(1):77–91, 1952

Markowitz, H. Portfolio Selection.Journal of Finance, 7(1):77–91, 1952

1952
[26]

Equilibrium in a Capital Asset Market.Econometrica, 34(4):768–783, 1966

Mossin, J. Equilibrium in a Capital Asset Market.Econometrica, 34(4):768–783, 1966

1966
[27]

S., O’Brien, J

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative Agents: Interactive Simulacra of Human Behavior.Proceedings of UIST, 2023

2023
[28]

L.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Puterman, M. L.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wi- ley, 1994

1994
[29]

and Petrov, A

Rubtsov, M. and Petrov, A. A Point-in-Time–Through-the-Cycle Approach to Rating Assign- ment and Probability of Default Calibration.Journal of Risk Model Validation, 10(2):83–112,
[30]

DOI: 10.21314/JRMV.2016.154

work page doi:10.21314/jrmv.2016.154 2016
[31]

Shannon, C. E. A Mathematical Theory of Communication.Bell System Technical Journal, 27:379–423, 1948

1948
[32]

Sharpe, W. F. Capital Asset Prices: A Theory of Market Equilibrium Under Conditions of Risk.Journal of Finance, 19(3):425–442, 1964

1964
[33]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning.arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Quantification of model risk in stress testing and scenario analysis.Journal of Risk Model Validation, 13(1), 1–25, 2019

Skoglund, J. Quantification of model risk in stress testing and scenario analysis.Journal of Risk Model Validation, 13(1), 1–25, 2019

2019
[35]

Sutton, R. S. and Barto, A. G.Reinforcement Learning: An Introduction. MIT Press, 2018

2018
[36]

Morgan and Claypool, 2010

Szepesvari, C.Algorithms for Reinforcement Learning. Morgan and Claypool, 2010

2010
[37]

A Survey on Large Language Model Based Autonomous Agents.Frontiers of Computer Science, 18, 2024

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., et al. A Survey on Large Language Model Based Autonomous Agents.Frontiers of Computer Science, 18, 2024

2024
[38]

Taxonomy of Risks Posed by Language Models.FAccT, 2022

Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., et al. Taxonomy of Risks Posed by Language Models.FAccT, 2022

2022
[39]

The Rise and Potential of Large Language Model Based Agents: A Survey

Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., et al. The Rise and Potential of Large Language Model Based Agents: A Survey.arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

ReAct: Syn- ergizing Reasoning and Acting in Language Models.International Conference on Learning Representations, 2023

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. ReAct: Syn- ergizing Reasoning and Acting in Language Models.International Conference on Learning Representations, 2023. 27

2023
[41]

and Tung, T

Zhang, X. and Tung, T. On the Mathematical Modeling of Point-in-Time and Through-the- Cycle Probability of Default Estimation and Validation.Journal of Risk Model Validation, 13(1):25–49, 2019. DOI: 10.21314/JRMV.2019.202. A Proofs A.1 Belief-State Sufficiency Proof.For any measurable functionfof future states and observations, E[f(S t+1, St+2, . . .)|H t...

work page doi:10.21314/jrmv.2019.202 2019

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., et al. Constitutional AI: Harmlessness from AI Feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Principles for Effective Risk Data Aggregation and Risk Reporting

Basel Committee on Banking Supervision. Principles for Effective Risk Data Aggregation and Risk Reporting. BCBS 239, Bank for International Settlements, Basel, Switzerland, January

[3] [3]

Available at:https://www.bis.org/publ/bcbs239.htm

[4] [4]

Princeton University Press, 1957

Bellman, R.Dynamic Programming. Princeton University Press, 1957

1957

[5] [5]

O.Statistical Decision Theory and Bayesian Analysis

Berger, J. O.Statistical Decision Theory and Bayesian Analysis. Springer, 1985

1985

[6] [6]

Bernardo, J. M. and Smith, A. F. M.Bayesian Theory. Wiley, 2000. 25

2000

[7] [7]

P.Dynamic Programming and Optimal Control

Bertsekas, D. P.Dynamic Programming and Optimal Control. Athena Scientific, 1995

1995

[8] [8]

and Litterman, R

Black, F. and Litterman, R. Global Portfolio Optimization.Financial Analysts Journal, 48(5):28–43, 1992

1992

[9] [9]

Supervisory Guidance on Model Risk Management (SR 11-7)

Board of Governors of the Federal Reserve System and Office of the Comptroller of the Cur- rency. Supervisory Guidance on Model Risk Management (SR 11-7). Federal Reserve System, Washington, DC, 2011. Available at:https://www.federalreserve.gov/supervisionreg/ srletters/sr1107.htm

2011

[10] [10]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D., Adeli, E., Altman, R., Arora, S., von Arx, S., et al. On the Opportunities and Risks of Foundation Models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Brier, G. W. Verification of Forecasts Expressed in Terms of Probability.Monthly Weather Review, 78(1):1–3, 1950

1950

[12] [12]

Cassandra, A. R. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. PhD Thesis, Brown University, 1998

1998

[13] [13]

Cover, T. M. and Thomas, J. A.Elements of Information Theory. Wiley, 2006

2006

[14] [14]

Dawid, A. P. The Well-Calibrated Bayesian.Journal of the American Statistical Association, 77(379):605–610, 1982

1982

[15] [15]

F., Halperin, I., and Bilokon, P.Machine Learning in Finance: From Theory to Practice

Dixon, M. F., Halperin, I., and Bilokon, P.Machine Learning in Finance: From Theory to Practice. Springer, 2020

2020

[16] [16]

J., Aggoun, L., and Moore, J

Elliott, R. J., Aggoun, L., and Moore, J. B.Hidden Markov Models: Estimation and Control. Springer, 1995

1995

[17] [17]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., et al. Red Teaming Language Models to Reduce Harms.arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

and Raftery, A

Gneiting, T. and Raftery, A. E. Strictly Proper Scoring Rules, Prediction and Estimation. Journal of the American Statistical Association, 102(477):359–378, 2007

2007

[19] [19]

Grinold, R. C. and Kahn, R. N.Active Portfolio Management. McGraw-Hill, 2000

2000

[20] [20]

Hamilton, J. D. A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle.Econometrica, 57(2):357–384, 1989

1989

[21] [21]

T.Probability Theory: The Logic of Science

Jaynes, E. T.Probability Theory: The Logic of Science. Cambridge University Press, 2003

2003

[22] [22]

P., Littman, M

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Planning and Acting in Partially Observable Stochastic Domains.Artificial Intelligence, 101(1–2):99–134, 1998. 26

1998

[23] [23]

Holistic Evaluation of Language Models

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., et al. Holistic Evaluation of Language Models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios and Capital Budgets.Review of Economics and Statistics, 47(1):13–37, 1965

Lintner, J. The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios and Capital Budgets.Review of Economics and Statistics, 47(1):13–37, 1965

1965

[25] [25]

Portfolio Selection.Journal of Finance, 7(1):77–91, 1952

Markowitz, H. Portfolio Selection.Journal of Finance, 7(1):77–91, 1952

1952

[26] [26]

Equilibrium in a Capital Asset Market.Econometrica, 34(4):768–783, 1966

Mossin, J. Equilibrium in a Capital Asset Market.Econometrica, 34(4):768–783, 1966

1966

[27] [27]

S., O’Brien, J

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative Agents: Interactive Simulacra of Human Behavior.Proceedings of UIST, 2023

2023

[28] [28]

L.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Puterman, M. L.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wi- ley, 1994

1994

[29] [29]

and Petrov, A

Rubtsov, M. and Petrov, A. A Point-in-Time–Through-the-Cycle Approach to Rating Assign- ment and Probability of Default Calibration.Journal of Risk Model Validation, 10(2):83–112,

[30] [30]

DOI: 10.21314/JRMV.2016.154

work page doi:10.21314/jrmv.2016.154 2016

[31] [31]

Shannon, C. E. A Mathematical Theory of Communication.Bell System Technical Journal, 27:379–423, 1948

1948

[32] [32]

Sharpe, W. F. Capital Asset Prices: A Theory of Market Equilibrium Under Conditions of Risk.Journal of Finance, 19(3):425–442, 1964

1964

[33] [33]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning.arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Quantification of model risk in stress testing and scenario analysis.Journal of Risk Model Validation, 13(1), 1–25, 2019

Skoglund, J. Quantification of model risk in stress testing and scenario analysis.Journal of Risk Model Validation, 13(1), 1–25, 2019

2019

[35] [35]

Sutton, R. S. and Barto, A. G.Reinforcement Learning: An Introduction. MIT Press, 2018

2018

[36] [36]

Morgan and Claypool, 2010

Szepesvari, C.Algorithms for Reinforcement Learning. Morgan and Claypool, 2010

2010

[37] [37]

A Survey on Large Language Model Based Autonomous Agents.Frontiers of Computer Science, 18, 2024

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., et al. A Survey on Large Language Model Based Autonomous Agents.Frontiers of Computer Science, 18, 2024

2024

[38] [38]

Taxonomy of Risks Posed by Language Models.FAccT, 2022

Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., et al. Taxonomy of Risks Posed by Language Models.FAccT, 2022

2022

[39] [39]

The Rise and Potential of Large Language Model Based Agents: A Survey

Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., et al. The Rise and Potential of Large Language Model Based Agents: A Survey.arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

ReAct: Syn- ergizing Reasoning and Acting in Language Models.International Conference on Learning Representations, 2023

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. ReAct: Syn- ergizing Reasoning and Acting in Language Models.International Conference on Learning Representations, 2023. 27

2023

[41] [41]

and Tung, T

Zhang, X. and Tung, T. On the Mathematical Modeling of Point-in-Time and Through-the- Cycle Probability of Default Estimation and Validation.Journal of Risk Model Validation, 13(1):25–49, 2019. DOI: 10.21314/JRMV.2019.202. A Proofs A.1 Belief-State Sufficiency Proof.For any measurable functionfof future states and observations, E[f(S t+1, St+2, . . .)|H t...

work page doi:10.21314/jrmv.2019.202 2019