PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval

Chun Chet Ng; Jia Yu Lim; Wei Zeng Low

arxiv: 2511.14130 · v2 · submitted 2025-11-18 · 💻 cs.AI · cs.CE· cs.CL· cs.IR

PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval

Chun Chet Ng , Jia Yu Lim , Wei Zeng Low This is my paper

Pith reviewed 2026-05-17 21:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.CLcs.IR

keywords financial information retrievalprompt engineeringin-context learningmulti-agent coordinationtraining-free methodsdocument rankingLLM applicationsablation studies

0 comments

The pith

A training-free framework combining refined prompts and selective in-context learning ranks third on a financial retrieval benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PRISM to handle document and chunk ranking in lengthy financial filings using only refined system prompts, selective in-context learning, and lightweight multi-agent coordination, all without any model training. Through systematic ablations on three benchmarks, it maps out when each piece adds value: basic prompt work gives steady gains at low cost, in-context examples help mainly on tricky queries, and agent coordination matters most with bigger models. The results show that simpler setups frequently beat elaborate multi-agent pipelines, which matters for finance teams that need reliable extraction from reports without heavy compute or fine-tuning expenses. The strongest result reaches an NDCG@5 of 0.71818 on FinAgentBench as the only training-free entry in the top three, backed by latency and cost checks to aid real deployment.

Core claim

PRISM is a training-free framework that integrates refined system prompting, in-context learning, and lightweight multi-agent coordination for document and chunk ranking tasks in financial information retrieval. Extensive ablation studies across FinAgentBench, FiQA-2018, and FinanceBench reveal that prompt engineering delivers consistent performance with minimal overhead, ICL enhances reasoning for complex queries when applied selectively, and multi-agent systems show potential primarily with larger models and careful architectural design. Simpler configurations often outperform complex multi-agent pipelines, and the best configuration achieves an NDCG@5 of 0.71818 on FinAgentBench, ranking

What carries the argument

The ablation studies that isolate the separate contributions of refined system prompting, selective in-context learning, and lightweight multi-agent coordination to ranking performance.

If this is right

Prompt engineering supplies consistent performance gains with very low added overhead.
Selective in-context learning improves handling of complex financial queries but is not always required.
Multi-agent coordination provides benefits mainly when used with larger models and deliberate architecture choices.
Simpler configurations frequently deliver higher effectiveness than full multi-agent pipelines.
Latency, token, and cost measurements support concrete decisions about when to deploy the approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Finance teams might achieve better returns by focusing effort on prompt design rather than building elaborate agent systems for similar retrieval needs.
The pattern favoring simplicity could extend to other fields that analyze long technical or regulatory documents.
Re-running the same ablations with different base models would test whether the preference for simpler setups remains stable.

Load-bearing premise

The three benchmarks and the ablation design isolate the value of each component without confounding effects from model size, prompt wording details, or dataset-specific artifacts.

What would settle it

A new experiment on FinAgentBench or a comparable financial benchmark in which a complex multi-agent pipeline clearly outperforms the simpler prompt-plus-selective-ICL configuration would undermine the claim that simpler setups are often preferable.

Figures

Figures reproduced from arXiv: 2511.14130 by Chun Chet Ng, Jia Yu Lim, Wei Zeng Low.

**Figure 2.** Figure 2: , relevant chunks are generally longer and more information dense than irrelevant ones. The wider interquartile ranges and higher maximum values, along with the long-tail distribution, indicate substantial variability in chunk length. This suggests that a dynamic retrieval and ranking method is needed to handle both typical and unusually long chunks to avoid truncation and processing inefficiencies. Mor… view at source ↗

**Figure 4.** Figure 4: Token distribution bar charts. prompts and larger models. All runs exhibit low variability, with standard deviation (s < 0.011) and coefficient of variation (CV < 1.6%), indicating stable and reproducible outcomes. The narrow 95% confidence intervals (CI) confirm the statistical reliability of the mean performance estimates. Run 12 recorded the lowest mean score, while Runs 15–19 achieved consistently hig… view at source ↗

**Figure 5.** Figure 5: Word count distribution of chunks. We conducted a frequency analysis on chunk word counts to complement the token count analysis and validate our observations. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Top 10 teams ranked by private subset scores. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and analytical decision-making. We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and lightweight multi-agent coordination for document and chunk ranking tasks. Our primary contribution is a systematic empirical study of when each component provides value: prompt engineering delivers consistent performance with minimal overhead, ICL enhances reasoning for complex queries when applied selectively, and multi-agent systems show potential primarily with larger models and careful architectural design. Extensive ablation studies across FinAgentBench, FiQA-2018, and FinanceBench reveal that simpler configurations often outperform complex multi-agent pipelines, providing practical guidance for practitioners. Our best configuration achieves an NDCG@5 of 0.71818 on FinAgentBench, ranking third while being the only training-free approach in the top three. We provide comprehensive feasibility analyses covering latency, token usage, and cost trade-offs to support deployment decisions. The source code is released at https://bit.ly/prism-ailens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM gives practical ablation results on prompting for financial retrieval but the component gains may not be cleanly isolated.

read the letter

This paper gives a clear empirical look at applying prompt engineering, in-context learning, and multi-agent coordination to financial document retrieval. The key takeaway is that their best setup hits 0.718 NDCG@5 on FinAgentBench without training, landing third overall but first among training-free methods. What stands out is the systematic ablations across FinAgentBench, FiQA-2018, and FinanceBench. They show prompt engineering gives steady gains with low cost, ICL helps on harder queries, and agents add value mainly with larger models. The latency, token, and cost analyses are practical and directly useful for deployment decisions. Releasing the code is good practice and lets others check the details. The soft spots center on the ablation controls. The reported improvements could be sensitive to the specific prompt templates and in-context example selections. On financial benchmarks with specialized terminology, small wording changes often shift rankings by similar margins to the ones shown. The paper does not test multiple prompt variants or randomized example sets when isolating each component. This leaves open the possibility that the value attributed to ICL or agents is partly an artifact of the fixed choices used in the experiments. The work is aimed at practitioners building retrieval systems in finance who need training-free options and guidance on adding complexity. It supplies actionable comparisons rather than new theory. A serious referee would be appropriate because the metrics are reported, code is available, and the questions are relevant to an industrial setting. I recommend sending it for review, with feedback focused on adding prompt sensitivity checks to strengthen the claims.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PRISM, a training-free framework for financial information retrieval that integrates refined system prompting, in-context learning (ICL), and lightweight multi-agent coordination. Through systematic ablations on FinAgentBench, FiQA-2018, and FinanceBench, it examines the value of each component, concluding that prompt engineering delivers consistent gains with minimal overhead, ICL enhances reasoning for complex queries when applied selectively, and multi-agent coordination shows potential primarily with larger models. The best configuration achieves an NDCG@5 of 0.71818 on FinAgentBench, ranking third overall while being the only training-free method in the top three. The paper also provides feasibility analyses on latency, token usage, and costs, and releases the source code.

Significance. If the empirical results hold under more rigorous isolation of components, this work supplies practical guidance for practitioners on configuring LLMs for financial document and chunk ranking without training. The observation that simpler configurations often outperform complex multi-agent pipelines is a useful takeaway. Public code release and explicit cost/latency trade-off analyses strengthen the contribution by supporting reproducibility and deployment decisions. The significance is limited by the need to confirm that ablation deltas are attributable to the intended factors rather than prompt or example artifacts.

major comments (2)

[§4.2] §4.2 (Ablation Studies): The design claims to isolate the contributions of prompt refinement, ICL, and agents, yet the description does not indicate whether the refined system prompt and ICL example set were held fixed when toggling ICL or agent configurations. Because the framework depends on prompt wording and example selection, unexamined interactions could produce deltas comparable to the reported gains, directly affecting the central claim that the study reveals when each component provides value on financial queries.
[Table 1] Table 1 (Main Results): The ranking claim for the best configuration (NDCG@5 = 0.71818, third place) relies on single-point estimates across methods. Without reported standard deviations, multiple random seeds, or statistical tests, it is difficult to establish that the training-free result reliably outperforms or matches other approaches on FinAgentBench.

minor comments (2)

[Abstract] Abstract: The NDCG@5 value is given to five decimal places; a short note on evaluation variance or rounding convention would clarify whether this precision is meaningful.
[§5] §5 (Feasibility Analyses): The latency and cost discussions are helpful, but adding scaling behavior with respect to average financial filing length would aid practitioners in extrapolating to their own document collections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments in detail below, indicating where we will revise the manuscript to incorporate the feedback.

read point-by-point responses

Referee: [§4.2] §4.2 (Ablation Studies): The design claims to isolate the contributions of prompt refinement, ICL, and agents, yet the description does not indicate whether the refined system prompt and ICL example set were held fixed when toggling ICL or agent configurations. Because the framework depends on prompt wording and example selection, unexamined interactions could produce deltas comparable to the reported gains, directly affecting the central claim that the study reveals when each component provides value on financial queries.

Authors: We thank the referee for highlighting this potential ambiguity in our ablation studies. We will revise the description in §4.2 to explicitly state that the refined system prompt and the ICL example set were held fixed when varying the ICL and agent configurations. This ensures that the reported deltas can be attributed to the toggled components rather than changes in prompts or examples. We believe this clarification will strengthen the central claim regarding the value of each component. revision: yes
Referee: [Table 1] Table 1 (Main Results): The ranking claim for the best configuration (NDCG@5 = 0.71818, third place) relies on single-point estimates across methods. Without reported standard deviations, multiple random seeds, or statistical tests, it is difficult to establish that the training-free result reliably outperforms or matches other approaches on FinAgentBench.

Authors: We acknowledge that multiple runs with reported variance would strengthen the reliability of the ranking claims. However, the high computational and financial costs of running LLMs on these benchmarks limited us to single-point estimates for each configuration. We will add a discussion of this limitation in the revised manuscript, including a note on the potential variability and the practical constraints. If space permits, we may include results from a limited number of additional seeds for the top configurations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical reporting

full rationale

The paper presents a training-free empirical framework evaluated via direct benchmark measurements (NDCG@5 on FinAgentBench, FiQA-2018, FinanceBench) and ablation comparisons of prompt, ICL, and agent configurations. No mathematical derivations, equations, fitted parameters, or predictions appear that could reduce to inputs by construction. Claims rest on observed performance deltas rather than self-referential definitions or self-citation chains. This is self-contained empirical reporting against external benchmarks, consistent with a normal non-circular outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that current LLMs possess sufficient zero-shot and few-shot reasoning capacity for retrieval when given well-engineered prompts; no free parameters or new entities are introduced.

axioms (1)

domain assumption Current LLMs can perform document and chunk ranking on financial text with appropriate prompting and selective in-context examples without any parameter updates.
Stated in the description of the training-free framework and the decision to avoid fine-tuning.

pith-pipeline@v0.9.0 · 5511 in / 1280 out tokens · 41625 ms · 2026-05-17T21:29:12.791748+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and lightweight multi-agent coordination for document and chunk ranking tasks.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Systematic prompt engineering is then applied to construct reasoning-oriented prompts... ICL augmentation... multi-agent system modelling.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Yiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Xinyu Ma, Wei Yang, Daiting Shi, Jiaxin Mao, and Dawei Yin

Data distributional properties drive emer- gent in-context learning in transformers.Preprint, arXiv:2205.05055. Yiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Xinyu Ma, Wei Yang, Daiting Shi, Jiaxin Mao, and Dawei Yin

work page arXiv
[2]

Chanyeol Choi, Jihoon Kwon, Jaeseon Ha, Hojun Choi, Chaewoon Kim, Yongjae Lee, Jy-yong Sohn, and Alejandro Lopez-Lira

Tourrank: Utilizing large language models for documents ranking with a tournament-inspired strategy.Preprint, arXiv:2406.11678. Chanyeol Choi, Jihoon Kwon, Jaeseon Ha, Hojun Choi, Chaewoon Kim, Yongjae Lee, Jy-yong Sohn, and Alejandro Lopez-Lira. 2025a. Finder: Finan- cial dataset for question answering and evaluating retrieval-augmented generation. Chany...

work page arXiv
[3]

Gordon V

Meta-in-context learning in large language models.Preprint, arXiv:2305.12907. Gordon V . Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on Research and develop- ment in information retrieval,...

work page arXiv 2009
[4]

GPT-4o-mini:gpt-4o-mini-2024-07-18

work page 2024
[5]

GPT-4.1:gpt-4.1-2025-04-14

work page 2025
[6]

GPT-5-mini:gpt-5-mini-2025-08-07

work page 2025
[7]

cash,” “cash equivalents,

GPT-5:gpt-5-2025-08-07 The retrieval pipeline was implemented using a FAISS vector store with two OpenAI’s embed- ding backbones: text-embedding-3-small v1 (TE3- S) and text-embedding-3-large (TE3-L). Multi- agent workflows were constructed with Lang- Graph (v1.0.3), and all models were accessed through the OpenAI Python SDK (v2.3.0). A.3.1 Model Provider...

work page 2025

[1] [1]

Yiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Xinyu Ma, Wei Yang, Daiting Shi, Jiaxin Mao, and Dawei Yin

Data distributional properties drive emer- gent in-context learning in transformers.Preprint, arXiv:2205.05055. Yiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Xinyu Ma, Wei Yang, Daiting Shi, Jiaxin Mao, and Dawei Yin

work page arXiv

[2] [2]

Chanyeol Choi, Jihoon Kwon, Jaeseon Ha, Hojun Choi, Chaewoon Kim, Yongjae Lee, Jy-yong Sohn, and Alejandro Lopez-Lira

Tourrank: Utilizing large language models for documents ranking with a tournament-inspired strategy.Preprint, arXiv:2406.11678. Chanyeol Choi, Jihoon Kwon, Jaeseon Ha, Hojun Choi, Chaewoon Kim, Yongjae Lee, Jy-yong Sohn, and Alejandro Lopez-Lira. 2025a. Finder: Finan- cial dataset for question answering and evaluating retrieval-augmented generation. Chany...

work page arXiv

[3] [3]

Gordon V

Meta-in-context learning in large language models.Preprint, arXiv:2305.12907. Gordon V . Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on Research and develop- ment in information retrieval,...

work page arXiv 2009

[4] [4]

GPT-4o-mini:gpt-4o-mini-2024-07-18

work page 2024

[5] [5]

GPT-4.1:gpt-4.1-2025-04-14

work page 2025

[6] [6]

GPT-5-mini:gpt-5-mini-2025-08-07

work page 2025

[7] [7]

cash,” “cash equivalents,

GPT-5:gpt-5-2025-08-07 The retrieval pipeline was implemented using a FAISS vector store with two OpenAI’s embed- ding backbones: text-embedding-3-small v1 (TE3- S) and text-embedding-3-large (TE3-L). Multi- agent workflows were constructed with Lang- Graph (v1.0.3), and all models were accessed through the OpenAI Python SDK (v2.3.0). A.3.1 Model Provider...

work page 2025