Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation
Pith reviewed 2026-05-21 12:13 UTC · model grok-4.3
The pith
Conv-FinRe supplies multi-view references that separate what investors actually choose from what aligns with their own long-term risk preferences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Conv-FinRe is a conversational and longitudinal benchmark that evaluates LLMs on stock recommendation by providing multi-view references distinguishing descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether models follow rational analysis, mimic user noise, or are driven by market momentum.
What carries the argument
Multi-view references constructed from real market data and human decision trajectories that separately score descriptive user choices and normative utility based on investor risk preferences.
If this is right
- Models that achieve high utility-based rankings can be identified even when they diverge from recorded user selections.
- Behaviorally aligned models can be flagged when they reproduce short-term noise rather than stable preferences.
- Advisory systems can be tuned toward rational decision quality without requiring perfect imitation of every user action.
- Longitudinal evaluation becomes possible because the benchmark supplies context across multiple market steps and a fixed horizon.
Where Pith is reading between the lines
- The same multi-view reference approach could be applied to recommendation domains outside finance where short-term user actions conflict with stated long-term goals.
- Benchmark results could guide the design of hybrid systems that first elicit risk preferences and then generate recommendations conditioned on those preferences rather than raw history.
- Public release of the dataset allows repeated testing of whether newer models close the observed gap between utility alignment and behavioral imitation.
Load-bearing premise
The references built from market data and human trajectories cleanly isolate normative utility from observed behavior without major construction artifacts or selection biases.
What would settle it
A direct comparison in which one set of model outputs is scored only against the normative utility references and another only against the descriptive behavior references, then checked for whether the two rankings produce measurably different portfolio outcomes over the stated investment horizon.
Figures
read the original abstract
Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that moves beyond behavior imitation. It constructs the dataset from real market data and human decision trajectories, instantiates advisory dialogues, and supplies multi-view references: one capturing descriptive user choices and another grounding normative utility in investor-specific risk preferences. Evaluation of state-of-the-art LLMs reveals a persistent tension in which utility-aligned models often diverge from observed user actions while behaviorally aligned models overfit short-term noise. The dataset and codebase are released publicly.
Significance. If the multi-view references prove robust, the benchmark would provide a valuable tool for diagnosing whether LLMs perform rational analysis, mimic user noise, or follow market momentum in financial settings. This addresses a clear gap in recommendation evaluation by prioritizing long-term utility over pure behavioral matching. Explicit credit is due for the public release of the dataset on Hugging Face and the codebase on GitHub, which supports reproducibility.
major comments (2)
- [§3] Benchmark construction (likely §3): the paper must specify the exact procedure for inferring investor-specific risk preferences from human decision trajectories and demonstrate that this normative reference is constructed independently of the descriptive behavior trajectories. If the same trajectories are used for both, the claimed separation between normative utility and observed behavior risks circularity or selection bias, directly undermining the diagnostic power for rational analysis versus noise mimicry.
- [§5] Evaluation and results (likely §5): the reported tension between utility-based ranking and behavioral alignment lacks details on the precise metrics employed, any statistical significance testing, and controls for market volatility or cohort selection. Without these, it is unclear whether the observed performance gap is robust or an artifact of the reference construction.
minor comments (1)
- [§4] The abstract refers to 'step-wise market context' and 'advisory dialogues'; including one or two concrete example conversations in the main text would help readers understand the longitudinal and conversational structure.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify important areas where additional clarity and rigor will strengthen the manuscript. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [§3] Benchmark construction (likely §3): the paper must specify the exact procedure for inferring investor-specific risk preferences from human decision trajectories and demonstrate that this normative reference is constructed independently of the descriptive behavior trajectories. If the same trajectories are used for both, the claimed separation between normative utility and observed behavior risks circularity or selection bias, directly undermining the diagnostic power for rational analysis versus noise mimicry.
Authors: We agree that explicit specification of the inference procedure and a clear demonstration of independence are necessary to substantiate the multi-view reference design. The current manuscript describes the high-level construction from real market data and human trajectories but does not provide the algorithmic details. In the revision we will add a dedicated subsection to §3 that (i) states the exact procedure: risk aversion parameters are estimated via maximum-likelihood fitting of a CRRA utility model to responses from a dedicated risk-elicitation questionnaire administered at onboarding, using only those survey answers; (ii) shows that the normative reference is computed solely from these elicited parameters and the subsequent market context, while the descriptive reference uses the actual longitudinal choice sequences; and (iii) includes a short validation that the two references are not mechanically identical (e.g., correlation between inferred risk aversion and raw choices is moderate and consistent with rational behavior rather than tautological). This separation uses disjoint data sources and will be illustrated with pseudocode. revision: yes
-
Referee: [§5] Evaluation and results (likely §5): the reported tension between utility-based ranking and behavioral alignment lacks details on the precise metrics employed, any statistical significance testing, and controls for market volatility or cohort selection. Without these, it is unclear whether the observed performance gap is robust or an artifact of the reference construction.
Authors: We acknowledge that the evaluation section would benefit from greater methodological transparency and robustness checks. We will revise §5 to (i) define the metrics explicitly (Kendall tau and NDCG for utility alignment; top-k accuracy and behavioral correlation for descriptive matching); (ii) report statistical significance via paired t-tests and Wilcoxon signed-rank tests on the per-model differences, with p-values and effect sizes; and (iii) add controlled analyses that stratify results by market-volatility regimes (high vs. low VIX periods) and by participant cohorts (novice vs. experienced investors). These new tables and figures will be included in the revised manuscript to demonstrate that the reported tension persists across these controls. revision: yes
Circularity Check
No circularity: benchmark construction is externally grounded
full rationale
The paper presents Conv-FinRe as a new dataset and evaluation protocol built directly from real market data and human decision trajectories. Multi-view references are defined from these external sources to separate normative utility (tied to investor risk preferences) from observed behavior, without any derivation, fitted parameters, or equations that reduce the normative baseline to the descriptive observations by construction. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps in the provided text; the contribution is the benchmark itself rather than a closed mathematical chain. This is self-contained against external data sources.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real market data combined with recorded human decision trajectories can be used to instantiate controlled advisory conversations that reflect investor risk preferences.
invented entities (1)
-
Conv-FinRe benchmark with multi-view references
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models
RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.
-
Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents
LLM agents exhibit evaluation blindness in multi-turn financial advice, with stronger models showing up to 99.1% suitability violations when tool data is manipulated, as internal detection fails to produce safer outputs.
Reference graph
Works this paper leans on
-
[1]
R Arran. 2023. Behavioral finance: The psychology behind financial decision- making.Business Studies Journal15, 5 (2023), 1–2
work page 2023
-
[2]
M Bertero. 2006. Regularization methods for linear inverse problems. InInverse Problems: Lectures given at the 1st 1986 Session of the Centro Internazionale Matem- atico Estivo (CIME) held at Montecatini Terme, Italy, May 28–June 5, 1986. Springer, 52–112
work page 2006
-
[3]
Dimitris Bertsimas, Vishal Gupta, and Ioannis Ch Paschalidis. 2012. Inverse optimization: A new perspective on the Black-Litterman model.Operations research60, 6 (2012), 1389–1403
work page 2012
-
[4]
Dimitris Bertsimas, Vishal Gupta, and Ioannis Ch Paschalidis. 2015. Data-driven estimation in equilibrium using inverse optimization.Mathematical Programming 153, 2 (2015), 595–633
work page 2015
- [5]
-
[6]
Xu Chen, Jingsen Zhang, Lei Wang, Quanyu Dai, Zhenhua Dong, Ruiming Tang, Rui Zhang, Li Chen, Xin Zhao, and Ji-Rong Wen. 2023. REASONER: an explainable recommendation dataset with comprehensive labeling ground truths.Advances in Neural Information Processing Systems36 (2023), 14497–14515
work page 2023
-
[7]
Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongx- iang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering chatgpt’s capabilities in recommender systems. InProceedings of the 17th ACM Conference on Recom- mender Systems. 1126–1132
work page 2023
-
[8]
DeepSeek-AI. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
work page 2025
- [9]
-
[10]
Chongming Gao, Wenqiang Lei, Xiangnan He, Maarten De Rijke, and Tat-Seng Chua. 2021. Advances and challenges in conversational recommender systems: A survey.AI open2 (2021), 100–126
work page 2021
-
[11]
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...
-
[12]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Jiyoon Lee, Joonghoon Kim, and Pilsung Kang. 2026. CEREAL: personality-driven LLM-based conversational recommendation dataset with contextually-enriched and realistic user interactions.Multimedia Tools and Applications85, 2 (2026), 47
work page 2026
-
[15]
Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, Kp Subbalakshmi, Jimin Huang, et al
-
[16]
Investorbench: A benchmark for financial decision-making tasks with llm-based agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2509–2525
-
[17]
Tingting Liang, Chenxin Jin, Lingzhi Wang, Wenqi Fan, Congying Xia, Kai Chen, and Yuyu Yin. 2024. LLM-REDIAL: a large-scale dataset for conversational recommender systems created from user behaviors with llms. InFindings of the Association for Computational Linguistics ACL 2024. 8926–8939
work page 2024
-
[18]
Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, and Xiang Wang
-
[19]
Llara: Aligning large language models with sequential recommenders.CoRR (2023)
work page 2023
- [20]
-
[21]
Hanjia Lyu, Song Jiang, Hanqing Zeng, Yinglong Xia, Qifan Wang, Si Zhang, Ren Chen, Chris Leung, Jiajie Tang, and Jiebo Luo. 2024. Llm-rec: Personalized recom- mendation via prompting large language models. InFindings of the Association for Computational Linguistics: NAACL 2024. 583–612
work page 2024
-
[22]
Malik Magdon-Ismail and Amir F Atiya. 2004. Maximum drawdown.Risk Magazine17, 10 (2004), 99–102
work page 2004
-
[23]
Daniel McFadden. 1972. Conditional logit analysis of qualitative choice behavior. (1972)
work page 1972
-
[24]
Khalid Mehraj and Vinay Kumar. 2025. Psychological Biases in Investment Decisions: A Behavioral Finance Approach. (2025)
work page 2025
-
[25]
Andreas Oehler and Matthias Horn. 2024. Does ChatGPT provide better advice than robo-advisors?Finance Research Letters60 (2024), 104898
work page 2024
- [26]
- [27]
-
[28]
Mark Rubinstein. 2002. Markowitz’s" portfolio selection": A fifty-year retrospec- tive.The Journal of finance57, 3 (2002), 1041–1045
work page 2002
-
[29]
Chandan Kumar Sah and Xiaoli Lian. 2025. PerFairX: Is There a Balance Between Fairness and Personality in Large Language Model Recommendations?. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision. 2750–2759
work page 2025
- [30]
- [31]
-
[32]
Suraj Sharma, Joseph Brennan, and Jason Nurse. 2021. StockBabble: A conversa- tional financial agent to support stock market investors. InProceedings of the 3rd Conference on Conversational User Interfaces. 1–5
work page 2021
-
[33]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, et al. 2025. OpenAI GPT-5 System Card. arXiv:2601.03267 [cs.CL] https://arxiv.org/abs/2601.03267
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Ying So and Warren F Kuhfeld. 1995. Multinomial logit models. InSUGI 20 conference proceedings, Vol. 1995. 1227–1234
work page 1995
-
[35]
Yueming Sun and Yi Zhang. 2018. Conversational recommender system. InThe 41st international acm sigir conference on research & development in information retrieval. 235–244
work page 2018
-
[36]
Takehiro Takayanagi, Chung-Chi Chen, and Kiyoshi Izumi. 2023. Personalized dy- namic recommender system for investors. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2246–2250
work page 2023
-
[37]
Takehiro Takayanagi, Kiyoshi Izumi, Javier Sanz-Cruzado, Richard McCreadie, and Iadh Ounis. 2025. Are generative AI agents effective personalized financial advisors?. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 286–295
work page 2025
-
[38]
Takehiro Takayanagi, Masahiro Suzuki, Kiyoshi Izumi, Javier Sanz-Cruzado, Richard McCreadie, and Iadh Ounis. 2025. FinPersona: An LLM-Driven Conver- sational Agent for Personalized Financial Advising. InEuropean Conference on Information Retrieval. Springer, 13–18
work page 2025
-
[39]
Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Amos Tversky and Daniel Kahneman. 1992. Advances in prospect theory: Cumu- lative representation of uncertainty.Journal of Risk and uncertainty5, 4 (1992), 297–323
work page 1992
-
[41]
Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Chen Xu, Ziyang Xu, Jimin Huang, Guo- jun Xiong, Xiao-Yang Liu, Qianqian Xie, and Jian-Yun Nie. 2026. FinTag- ging: Benchmarking LLMs for Extracting and Structuring Financial Informatio...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, and Jian- Yun Nie. 2025. FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs. arXiv:2510.08886 [cs.CL] https://arxiv.org/abs/ 2510.08886
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
An Yang, Baosong Yang, Binyuan Hui, et al. 2024. Qwen2 Technical Report.arXiv preprint arXiv:2407.10671(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Qi Yang, Sergey Nikolenko, Alfred Huang, and Aleksandr Farseev. 2022. Personality-driven social multimedia content recommendation. InProceedings of the 30th ACM International Conference on Multimedia. 7290–7299
work page 2022
- [45]
-
[46]
Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, and Arman Cohan. 2024. DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents. arXiv:2311.09805 [cs.CL] https://arxiv.org/abs/2311. 09805
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.