LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
C onv F in QA : Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
DualGraph combines semantic textual KGs with symbolic KGs for semi-structured QA and introduces the SpecsQA benchmark, outperforming baselines on both open and specification questions.
OCC-RAG develops task-specialized SLMs (0.6B and 1.7B) via a new synthetic data pipeline for multi-hop reasoning and context faithfulness, claiming to match or exceed 2-6x larger general models on HotpotQA, MuSiQue, TAT-QA, ConFiQA, and MuSiQue-Un.
citing papers explorer
-
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering
DualGraph combines semantic textual KGs with symbolic KGs for semi-structured QA and introduces the SpecsQA benchmark, outperforming baselines on both open and specification questions.
-
OCC-RAG: Optimal Cognitive Core for Faithful Question Answering
OCC-RAG develops task-specialized SLMs (0.6B and 1.7B) via a new synthetic data pipeline for multi-hop reasoning and context faithfulness, claiming to match or exceed 2-6x larger general models on HotpotQA, MuSiQue, TAT-QA, ConFiQA, and MuSiQue-Un.