Meta-Benchmarks for Financial-Services LLM Evaluation

Blair Hudson

arxiv: 2607.01740 · v1 · pith:4JYHGMDGnew · submitted 2026-07-02 · 💻 cs.AI

Meta-Benchmarks for Financial-Services LLM Evaluation

Blair Hudson This is my paper

Pith reviewed 2026-07-03 14:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords meta-benchmarkingLLM evaluationfinancial servicesElo ratingswork activitiesbanking domainsmodel ranking

0 comments

The pith

A multiplicative weighting scheme on benchmarks scales Elo K-factors to produce comparable financial-services work-activity scores without raw-score normalisation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a meta-benchmarking framework that maps 452 public benchmarks onto 41 O*NET Generalized Work Activities and then aggregates those into 38 BIAN banking business domains. A weighting scheme multiplies discrimination, coverage, and recency values computed over a rolling model window; these weights adjust the K-factor inside a pairwise Elo tournament. The resulting work-activity scores are directly comparable across benchmarks, and business-domain scores are formed as weighted averages of the activity-level Elos. Standard global leaderboards average across all tasks and therefore fail to reflect the distinct demands of compliance reasoning, multi-turn customer handling, or risk assessment. If the framework works as described, financial institutions obtain task-specific model rankings that automatically down-weight saturated tests and remain reproducible from public data.

Core claim

The meta-benchmarking framework organises benchmarks into O*NET work activities and BIAN domains, applies a multiplicative discrimination-coverage-recency weight computed on a rolling window, and uses those weights to scale the K-factor of a pairwise Elo tournament, thereby generating cross-benchmark-comparable work-activity scores and derived business-domain scores without any raw-score normalisation step.

What carries the argument

The multiplicative weighting scheme (discrimination × coverage × recency) computed over a rolling model window that scales the K-factor inside the pairwise Elo tournament.

If this is right

Business-domain scores emerge directly as weighted averages of the constituent work-activity Elo ratings.
Saturated or obsolete benchmarks receive near-zero weight and drop out of the ranking automatically.
The same public snapshot of 288 models yields 41 activity-level and 38 domain-level scores that can be recomputed as new benchmark results appear.
Institutions can reproduce the full taxonomy and weighting procedure from the released methodology without access to private data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structure could be applied to other regulated industries by swapping the BIAN taxonomy for an equivalent domain map.
Over time the rolling window may naturally surface new benchmarks that better separate frontier models in compliance or customer-service tasks.
If the Elo scores prove stable under different K-scaling choices, the framework could serve as a governance tool for model procurement decisions.

Load-bearing premise

The O*NET Generalized Work Activities and BIAN banking domains correctly capture the cognitive demands of financial-services work, and the chosen weighting scheme ranks benchmarks without introducing selection bias or circularity into the Elo scores.

What would settle it

A controlled comparison showing that models ranked highest by the framework perform no better than lower-ranked models when tested on real, blinded financial-services tasks drawn from the same domains.

Figures

Figures reproduced from arXiv: 2607.01740 by Blair Hudson.

**Figure 1.** Figure 1: The evaluation pyramid. Reading bottomup, 288+ models are scored on 452 public benchmarks, which are mapped to 41 ONET Generalized Work Activities, aggregated into 38 BIAN business domains, and grouped under five BIAN Business Areas. describe practical applications of the resulting capability profiles in preliminary model comparison, riskinformed screening, and governance research. Fourth, we provide s… view at source ↗

**Figure 2.** Figure 2: The four-stage pipeline: benchmarks are col [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Number of benchmark identifiers assigned [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: Model releases per quarter (2022–2026), split [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Discrimination heat map for selected coding [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 8.** Figure 8: Distribution of model Elo scores per task (top [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 7.** Figure 7: Best-observed Elo score progression for the [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: Work-activity to business-domain mapping. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Left: Spearman ρ between four K-factor weighting schemes, averaged across four BIAN business domains. All pairs exceed ρ = 0.90. Right: IT Management Elo scores for the top-12 models under three representative schemes. Rankings are broadly consistent; the full formula makes modest adjustments for recent evaluation coverage. 7.2 Factor Ablation To examine the contribution of individual weight factors, [P… view at source ↗

**Figure 11.** Figure 11: Global composite rank vs per-domain rank [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Model evidence density per BIAN business [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Top-12 models ranked by business-domain Elo across four BIAN business domains. Amber bars indicate proprietary models; green bars indicate openweight models. Rankings vary substantially across domains, and open-weight models are competitive or leading on several domains, motivating domain-specific rather than global candidate screening. the same taxonomy regardless of provider, making likefor-like compa… view at source ↗

read the original abstract

Public LLM leaderboards optimise for global average performance and do not capture the specific cognitive demands of financial-services work: a model that leads on MMLU-Pro may underperform on document-grounded compliance reasoning, and a coding leader may handle multi-turn customer interactions poorly. We present a meta-benchmarking framework that organises 452 publicly reported benchmarks into 41 O*NET Generalized Work Activities and aggregates those into 38 BIAN banking business domains spanning sales, operations, risk, and support work. A multiplicative weighting scheme (discrimination x coverage x recency), computed over a rolling model window, rewards benchmarks that still separate the best models, are widely reported, and remain in active use, suppressing saturated legacy tests automatically. These weights scale the K-factor in a pairwise Elo tournament, producing cross-benchmark-comparable work-activity scores without raw score normalisation; business-domain scores are weighted averages of the constituent work-activity Elos. We demonstrate the framework on a point-in-time public snapshot covering 288 models across 25 organisations as of June 2026, and describe the methodology, full taxonomy, design decisions, and limitations with the aim of making the approach reproducible for institutions facing similar selection and governance challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable taxonomy-driven meta-benchmark for financial LLMs but the no-normalization Elo claim rests on unshown assumptions about pairwise outcomes.

read the letter

The main contribution here is a framework that maps 452 public benchmarks to 41 O*NET work activities, rolls those up into 38 BIAN banking domains, and applies a multiplicative weight (discrimination times coverage times recency) over a rolling window to scale the K-factor in an Elo tournament. That produces per-activity scores that are meant to be comparable without touching the raw benchmark numbers.

The taxonomy mapping and the weighting rule are the genuinely new pieces. They give a concrete way to down-weight saturated tests and keep the rating focused on benchmarks that still separate top models. The description of design choices and the point-in-time snapshot on 288 models make the method look reproducible on paper.

The soft spot is exactly where the stress-test lands. The abstract never states how a raw accuracy or F1 from one benchmark is turned into a win, loss, or expected score against another benchmark that uses a completely different metric and scale. Without an explicit outcome function, the claim that K-scaling alone buys cross-benchmark comparability is not yet supported. The demonstration section is mentioned but supplies no error analysis, no correlation with real financial tasks, and no comparison against simpler aggregation methods.

This is aimed at teams inside banks or regulators who need something more targeted than MMLU averages. A practitioner looking for a starting template for domain-specific eval will find the taxonomy and weighting logic useful even if they end up changing the details.

It deserves peer review. The gaps are clear and fixable with an outcome model and some validation checks; referees can push on those without the paper being incoherent on its own terms.

Referee Report

3 major / 2 minor

Summary. The paper introduces a meta-benchmarking framework that maps 452 public benchmarks onto 41 O*NET Generalized Work Activities, which are then aggregated into 38 BIAN banking business domains. A multiplicative weighting scheme (discrimination × coverage × recency) computed over a rolling model window is used to scale the K-factor in a pairwise Elo tournament; the resulting work-activity Elo ratings are asserted to be cross-benchmark comparable without any raw-score normalization, and business-domain scores are obtained as weighted averages of the constituent activity Elos. The framework is demonstrated on a June 2026 snapshot of 288 models from 25 organizations.

Significance. If the core technical claim holds, the work would supply a reproducible, domain-targeted alternative to generic LLM leaderboards for financial-services institutions. The use of established taxonomies (O*NET, BIAN) and the explicit reproducibility goal are constructive. However, the absence of any validation against downstream financial-task performance or comparison to normalized baselines substantially reduces the immediate significance of the reported demonstration.

major comments (3)

[Abstract / Method] Abstract and method description: the central claim that scaling the K-factor by (discrimination × coverage × recency) produces cross-benchmark-comparable Elo scores without raw-score normalization presupposes an explicit outcome model that converts heterogeneous benchmark metrics into pairwise win/loss or expected-score values. No such model (e.g., Bradley-Terry, logistic on accuracy, margin-based, or tie-handling rule) is stated, so it is impossible to verify that the resulting ratings remain on a common scale when each work activity aggregates a different subset of the 452 benchmarks.
[Demonstration] Demonstration / Results: the point-in-time evaluation on 288 models supplies no error analysis, sensitivity checks on the weighting parameters, or correlation with any external measure of financial-services task performance. Without such evidence the assertion that the weighted Elo scores “better capture the cognitive demands of financial-services work” remains unsupported and is load-bearing for the paper’s applied claim.
[Taxonomy] Taxonomy construction: the mapping of benchmarks to O*NET activities and BIAN domains is foundational to the aggregation step, yet no inter-rater agreement statistics, coverage statistics per activity, or validation against expert financial-services judgments are reported. This directly affects whether the final domain scores can be interpreted as reflecting the intended work activities.

minor comments (2)

[Abstract] The date “June 2026” in the abstract appears to be a typographical error or forward reference; clarify the actual snapshot date.
[Method] Notation for the rolling-window computation of the three weighting factors and the precise formula for the scaled K-factor should be given explicitly (ideally as numbered equations) rather than described only in prose.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on our meta-benchmarking framework. The comments identify key areas where additional methodological detail, quantitative checks, and limitation statements will improve the manuscript. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the central claim that scaling the K-factor by (discrimination × coverage × recency) produces cross-benchmark-comparable Elo scores without raw-score normalization presupposes an explicit outcome model that converts heterogeneous benchmark metrics into pairwise win/loss or expected-score values. No such model (e.g., Bradley-Terry, logistic on accuracy, margin-based, or tie-handling rule) is stated, so it is impossible to verify that the resulting ratings remain on a common scale when each work activity aggregates a different subset of the 452 benchmarks.

Authors: We agree that the outcome model requires explicit statement. The full manuscript applies a logistic Bradley-Terry model in which each benchmark's reported metric is converted to an expected win probability for the Elo update; the scaled K-factor is then applied to the resulting pairwise comparison. However, this conversion step and the tie-handling rule (scores within 1% treated as draws) were described only at a high level. We will add a dedicated paragraph in the Methods section formalizing the logistic link function, the per-benchmark expected-score calculation, and the aggregation logic that preserves a common scale across heterogeneous metrics. revision: yes
Referee: [Demonstration] Demonstration / Results: the point-in-time evaluation on 288 models supplies no error analysis, sensitivity checks on the weighting parameters, or correlation with any external measure of financial-services task performance. Without such evidence the assertion that the weighted Elo scores “better capture the cognitive demands of financial-services work” remains unsupported and is load-bearing for the paper’s applied claim.

Authors: We accept that the demonstration section lacks supporting quantitative checks. The June 2026 snapshot is intended to illustrate the framework rather than to validate downstream utility. We will insert bootstrap-derived standard errors on the activity-level Elo ratings and a sensitivity table showing score changes when each weighting component is varied by ±20%. Because no public benchmarks directly measure proprietary financial-services task performance, we will revise the claim language from “better capture” to “designed to reflect” and move external validation to the Limitations and Future Work section. revision: partial
Referee: [Taxonomy] Taxonomy construction: the mapping of benchmarks to O*NET activities and BIAN domains is foundational to the aggregation step, yet no inter-rater agreement statistics, coverage statistics per activity, or validation against expert financial-services judgments are reported. This directly affects whether the final domain scores can be interpreted as reflecting the intended work activities.

Authors: Coverage counts (benchmarks per O*NET activity and BIAN domain) are tabulated in the supplementary materials but were not summarized in the main text. We will add a concise table and accompanying text reporting these statistics. The mapping was performed by the author team following the published O*NET and BIAN definitions; no multi-rater agreement statistic was computed. Validation against external financial-services experts was not performed. We will explicitly note both points as limitations and will not claim expert-validated mappings. revision: partial

standing simulated objections not resolved

Direct validation of the O*NET/BIAN taxonomy mappings against judgments from practicing financial-services experts, which was outside the scope of the original study.

Circularity Check

0 steps flagged

No circularity: weighting from external benchmark properties applied to standard Elo

full rationale

The abstract defines the weighting scheme (discrimination × coverage × recency) from observable benchmark properties computed over a rolling model window, then applies those weights to scale the K-factor of a standard pairwise Elo system. Work-activity scores are produced by the Elo process and aggregated as weighted averages into BIAN domains. No equations, self-citations, or derivations are shown that reduce the final scores to the inputs by construction; the outcome model for pairwise comparisons is left implicit but the weighting itself is not tautological. This matches the reader's assessment of only minor non-circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond reliance on external taxonomies (O*NET, BIAN) and the standard Elo rating system.

pith-pipeline@v0.9.1-grok · 5731 in / 1291 out tokens · 40829 ms · 2026-07-03T14:05:37.725970+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 20 canonical work pages · 15 internal anchors

[1]

Australian Prudential Regulation Authority. 2026. `` APRA Letter to Industry on Artificial Intelligence ( AI ).'' APRA. https://www.apra.gov.au/apra-letter-to-industry-on-artificial-intelligence-ai

2026
[2]

Australian Securities and Investments Commission. 2024. `` REP 798 Beware the Gap: Governance Arrangements in the Face of AI Innovation.'' ASIC. https://asic.gov.au/regulatory-resources/find-a-document/reports/rep-798-beware-the-gap-governance-arrangements-in-the-face-of-ai-innovation/

2024
[3]

Bank for International Settlements Financial Stability Institute. 2024. ``Regulating AI in the Financial Sector: Recent Developments and Main Challenges.'' FSI Insights on Policy Implementation 63. Bank for International Settlements. https://www.bis.org/fsi/publ/insights63.htm

2024
[4]

Banking Industry Architecture Network. 2024. `` BIAN Service Landscape 14.0.0.'' https://bian.org/servicelandscape-14-0-0/

2024
[5]

Chen, Simin, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, et al. 2025. ``Recent Advances in Large Language Model Benchmarks Against Data Contamination: From Static to Dynamic Evaluation.'' arXiv Preprint arXiv:2502.17521. https://arxiv.org/abs/2502.17521

work page arXiv 2025
[6]

Chiang, Wei-Lin, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, et al. 2024. ``Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.'' In Proceedings of the 41st International Conference on Machine Learning. https://arxiv.org/abs/2403.04132

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Fourrier, Clémentine, Nathan Habib, Alina Lozada, Kuba Szafer, Thomas Wolf, Julien Launay, and Edward Beeching. 2024. ``Open LLM Leaderboard V2.'' https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

2024
[8]

Gao, Leo, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, et al. 2024. ``A Framework for Few-Shot Language Model Evaluation.'' https://github.com/EleutherAI/lm-evaluation-harness

2024
[9]

Guldimann, Philipp, Alexander Spiridonov, Robin Staab, Nikola Jovanović, Mark Vero, Velko Vechev, Anna-Maria Gueorguieva, et al. 2024. `` COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act.'' arXiv Preprint arXiv:2410.07959. https://arxiv.org/abs/2410.07959

work page arXiv 2024
[10]

Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. ``Measuring Massive Multitask Language Understanding.'' https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Islam, Pranab, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. 2023. `` FinanceBench : A New Benchmark for Financial Question Answering.'' arXiv Preprint arXiv:2311.11944. https://arxiv.org/abs/2311.11944

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Kiela, Douwe, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, et al. 2021. ``Dynabench: Rethinking Benchmarking in NLP ,'' 4110--24. https://arxiv.org/abs/2104.14337

work page arXiv 2021
[13]

Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, et al. 2023. ``Holistic Evaluation of Language Models.'' Transactions on Machine Learning Research. https://arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

https://llm-stats.com

`` LLM Stats : A ggregated LLM Benchmark Results.'' 2024. https://llm-stats.com

2024
[15]

National Center for O*NET Development. 2024. `` O*NET Database: Generalized Work Activities.'' U.S. Department of Labor, Employment and Training Administration. https://www.onetcenter.org/database.html

2024
[16]

National Institute of Standards and Technology. 2024. ``Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile ( NIST AI 600-1 ).'' NIST. https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence

2024
[17]

OpenAI. 2024. ``Introducing SWE -Bench Verified.'' https://openai.com/index/introducing-swe-bench-verified/

2024
[18]

Patil, Shishir G, Tianjun Zhang, Xingyao Wang, and Joseph E Gonzalez. 2023. ``Berkeley Function Calling Leaderboard ( BFCL ).'' https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html

2023
[19]

Phan, Long, Alice Gatti, Ziwen Han, Fan Li, Tianyu Hu, Jeffrey Zhang, Aliaksei Doroshenko, et al. 2025. ``Humanity's Last Exam.'' arXiv Preprint arXiv:2501.14249. https://arxiv.org/abs/2501.14249

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Rein, David, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. `` GPQA : A Graduate-Level Google-Proof q&a Benchmark.'' https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, et al. 2023. ``Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.'' Transactions on Machine Learning Research. https://arxiv.org/abs/2206.04615

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Stanford CRFM. 2024. `` HELM Finance: Holistic Evaluation of Language Models on Financial Tasks.'' https://crfm.stanford.edu/helm/finance/latest/

2024
[23]

Suzgun, Mirac, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, et al. 2023. ``Challenging BIG -Bench Tasks and Whether Chain-of-Thought Can Solve Them.'' https://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Wang, Alex, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. `` SuperGLUE : A Stickier Benchmark for General-Purpose Language Understanding Systems'' 32. https://arxiv.org/abs/1905.00537

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. `` GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.'' https://arxiv.org/abs/1804.07461

work page internal anchor Pith review Pith/arXiv arXiv 2019
[26]

Wang, Yubo, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, et al. 2024. `` MMLU-Pro : A More Robust and Challenging Multi-Task Language Understanding Benchmark.'' In Advances in Neural Information Processing Systems. Vol. 37. https://arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

White, Colin, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, et al. 2025. `` LiveBench : A Challenging, Contamination-Limited LLM Benchmark.'' In Proceedings of the Thirteenth International Conference on Learning Representations. https://arxiv.org/abs/2406.19314

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Wu, Shijie, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. `` BloombergGPT : A Large Language Model for Finance.'' arXiv Preprint arXiv:2303.17564. https://arxiv.org/abs/2303.17564

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Xie, Qianqian, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, et al. 2024. `` FinBen : A Holistic Financial Benchmark for Large Language Models.'' In Advances in Neural Information Processing Systems. Vol. 37. https://arxiv.org/abs/2402.12659

work page arXiv 2024
[30]

Xie, Tianbao, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shi, et al. 2024. `` OSWorld : Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.'' arXiv Preprint arXiv:2404.07972. https://arxiv.org/abs/2404.07972

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Xu, Ruijie, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. 2024. ``Benchmarking Benchmark Leakage in Large Language Models.'' arXiv Preprint arXiv:2404.18824. https://arxiv.org/abs/2404.18824

work page arXiv 2024
[32]

Yao, Shunyu, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2025. ``\( \)-Bench: A Benchmark for Tool--Agent--User Interaction in Real-World Domains.'' In Proceedings of the Thirteenth International Conference on Learning Representations. https://arxiv.org/abs/2406.12045. CSLReferences document

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Australian Prudential Regulation Authority. 2026. `` APRA Letter to Industry on Artificial Intelligence ( AI ).'' APRA. https://www.apra.gov.au/apra-letter-to-industry-on-artificial-intelligence-ai

2026

[2] [2]

Australian Securities and Investments Commission. 2024. `` REP 798 Beware the Gap: Governance Arrangements in the Face of AI Innovation.'' ASIC. https://asic.gov.au/regulatory-resources/find-a-document/reports/rep-798-beware-the-gap-governance-arrangements-in-the-face-of-ai-innovation/

2024

[3] [3]

Bank for International Settlements Financial Stability Institute. 2024. ``Regulating AI in the Financial Sector: Recent Developments and Main Challenges.'' FSI Insights on Policy Implementation 63. Bank for International Settlements. https://www.bis.org/fsi/publ/insights63.htm

2024

[4] [4]

Banking Industry Architecture Network. 2024. `` BIAN Service Landscape 14.0.0.'' https://bian.org/servicelandscape-14-0-0/

2024

[5] [5]

Chen, Simin, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, et al. 2025. ``Recent Advances in Large Language Model Benchmarks Against Data Contamination: From Static to Dynamic Evaluation.'' arXiv Preprint arXiv:2502.17521. https://arxiv.org/abs/2502.17521

work page arXiv 2025

[6] [6]

Chiang, Wei-Lin, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, et al. 2024. ``Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.'' In Proceedings of the 41st International Conference on Machine Learning. https://arxiv.org/abs/2403.04132

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Fourrier, Clémentine, Nathan Habib, Alina Lozada, Kuba Szafer, Thomas Wolf, Julien Launay, and Edward Beeching. 2024. ``Open LLM Leaderboard V2.'' https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

2024

[8] [8]

Gao, Leo, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, et al. 2024. ``A Framework for Few-Shot Language Model Evaluation.'' https://github.com/EleutherAI/lm-evaluation-harness

2024

[9] [9]

Guldimann, Philipp, Alexander Spiridonov, Robin Staab, Nikola Jovanović, Mark Vero, Velko Vechev, Anna-Maria Gueorguieva, et al. 2024. `` COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act.'' arXiv Preprint arXiv:2410.07959. https://arxiv.org/abs/2410.07959

work page arXiv 2024

[10] [10]

Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. ``Measuring Massive Multitask Language Understanding.'' https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Islam, Pranab, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. 2023. `` FinanceBench : A New Benchmark for Financial Question Answering.'' arXiv Preprint arXiv:2311.11944. https://arxiv.org/abs/2311.11944

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Kiela, Douwe, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, et al. 2021. ``Dynabench: Rethinking Benchmarking in NLP ,'' 4110--24. https://arxiv.org/abs/2104.14337

work page arXiv 2021

[13] [13]

Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, et al. 2023. ``Holistic Evaluation of Language Models.'' Transactions on Machine Learning Research. https://arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

https://llm-stats.com

`` LLM Stats : A ggregated LLM Benchmark Results.'' 2024. https://llm-stats.com

2024

[15] [15]

National Center for O*NET Development. 2024. `` O*NET Database: Generalized Work Activities.'' U.S. Department of Labor, Employment and Training Administration. https://www.onetcenter.org/database.html

2024

[16] [16]

National Institute of Standards and Technology. 2024. ``Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile ( NIST AI 600-1 ).'' NIST. https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence

2024

[17] [17]

OpenAI. 2024. ``Introducing SWE -Bench Verified.'' https://openai.com/index/introducing-swe-bench-verified/

2024

[18] [18]

Patil, Shishir G, Tianjun Zhang, Xingyao Wang, and Joseph E Gonzalez. 2023. ``Berkeley Function Calling Leaderboard ( BFCL ).'' https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html

2023

[19] [19]

Phan, Long, Alice Gatti, Ziwen Han, Fan Li, Tianyu Hu, Jeffrey Zhang, Aliaksei Doroshenko, et al. 2025. ``Humanity's Last Exam.'' arXiv Preprint arXiv:2501.14249. https://arxiv.org/abs/2501.14249

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Rein, David, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. `` GPQA : A Graduate-Level Google-Proof q&a Benchmark.'' https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, et al. 2023. ``Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.'' Transactions on Machine Learning Research. https://arxiv.org/abs/2206.04615

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Stanford CRFM. 2024. `` HELM Finance: Holistic Evaluation of Language Models on Financial Tasks.'' https://crfm.stanford.edu/helm/finance/latest/

2024

[23] [23]

Suzgun, Mirac, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, et al. 2023. ``Challenging BIG -Bench Tasks and Whether Chain-of-Thought Can Solve Them.'' https://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Wang, Alex, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. `` SuperGLUE : A Stickier Benchmark for General-Purpose Language Understanding Systems'' 32. https://arxiv.org/abs/1905.00537

work page internal anchor Pith review Pith/arXiv arXiv 2019

[25] [25]

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. `` GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.'' https://arxiv.org/abs/1804.07461

work page internal anchor Pith review Pith/arXiv arXiv 2019

[26] [26]

Wang, Yubo, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, et al. 2024. `` MMLU-Pro : A More Robust and Challenging Multi-Task Language Understanding Benchmark.'' In Advances in Neural Information Processing Systems. Vol. 37. https://arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

White, Colin, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, et al. 2025. `` LiveBench : A Challenging, Contamination-Limited LLM Benchmark.'' In Proceedings of the Thirteenth International Conference on Learning Representations. https://arxiv.org/abs/2406.19314

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Wu, Shijie, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. `` BloombergGPT : A Large Language Model for Finance.'' arXiv Preprint arXiv:2303.17564. https://arxiv.org/abs/2303.17564

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Xie, Qianqian, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, et al. 2024. `` FinBen : A Holistic Financial Benchmark for Large Language Models.'' In Advances in Neural Information Processing Systems. Vol. 37. https://arxiv.org/abs/2402.12659

work page arXiv 2024

[30] [30]

Xie, Tianbao, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shi, et al. 2024. `` OSWorld : Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.'' arXiv Preprint arXiv:2404.07972. https://arxiv.org/abs/2404.07972

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Xu, Ruijie, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. 2024. ``Benchmarking Benchmark Leakage in Large Language Models.'' arXiv Preprint arXiv:2404.18824. https://arxiv.org/abs/2404.18824

work page arXiv 2024

[32] [32]

Yao, Shunyu, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2025. ``\( \)-Bench: A Benchmark for Tool--Agent--User Interaction in Real-World Domains.'' In Proceedings of the Thirteenth International Conference on Learning Representations. https://arxiv.org/abs/2406.12045. CSLReferences document

work page internal anchor Pith review Pith/arXiv arXiv 2025