pith. sign in

arxiv: 2606.23032 · v3 · pith:IMJ5N5QFnew · submitted 2026-06-22 · 💻 cs.AI · q-fin.GN

IPO Finance Agent: Benchmark of LLM Financial Analysts Beyond Finance Agent v2, with Automated Rubric Generation, on the SpaceX (SPCX) IPO

Pith reviewed 2026-07-01 06:46 UTC · model grok-4.3

classification 💻 cs.AI q-fin.GN
keywords IPO due diligenceLLM financial analysisSpaceX S-1 filingautomated rubric generationcontextual retrievalfinancial agent benchmarkevaluator-optimizer pipeline
0
0 comments X

The pith

Zhipu GLM-5.2 reaches 79.8% accuracy on IPO diligence for SpaceX S-1 filings, exceeding the Finance Agent v2 ceiling of 57.9%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends Finance Agent v2 from periodic SEC filings to the distinct challenges of IPO due diligence on S-1 documents, which contain longer narratives on governance, pro forma accounting, and risk disclosures. It replaces naive chunk retrieval with contextual retrieval to handle document length and adds an evaluator-optimizer pipeline that extracts candidate facts, drafts rubrics, audits them for omissions and hallucinations, and iterates repairs until human review. A dataset of 1,000 questions is created, with 70 public questions released on the SpaceX S-1 filing. On this benchmark the leading model scores 79.8% while a cost-efficient alternative reaches 77.2% at 0.05 USD per query, both above prior leaderboard entries.

Core claim

By improving the agentic harness with contextual retrieval and introducing an automated evaluator-optimizer pipeline for rubric generation, the IPO Finance Agent benchmark evaluates LLMs on 70 SpaceX S-1 diligence questions; Zhipu GLM-5.2 achieves 79.8% accuracy and Xiaomi MiMo-2.5 Pro reaches 77.2% at 0.05 USD per query, both surpassing the Finance Agent v2 ceiling of 57.9% at higher cost.

What carries the argument

The evaluator-optimizer pipeline that extracts facts from model answers, consolidates them into draft criteria, audits for omissions, hallucinations, and redundancy, then iterates with LLM feedback to produce final rubrics.

If this is right

  • Contextual retrieval enables agentic systems to process lengthy S-1 filings where the prior harness produced no output.
  • Automated rubric generation allows scalable evaluation of financial tasks with reduced manual effort.
  • Models can now be ranked on a Pareto frontier of accuracy versus cost for IPO-specific diligence.
  • The public 70-question set supports reproducible testing while the private remainder reduces contamination risk.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rubric pipeline and retrieval upgrades could be applied to due diligence on other private companies preparing S-1 filings.
  • Cost-efficient models at 0.05 USD per query open the possibility of routine automated screening for smaller investors or advisors.
  • If the benchmark generalizes, it would shift evaluation focus from periodic reports to capital-formation and governance analysis.

Load-bearing premise

The 70 public SpaceX questions plus the automated rubric pipeline produce a faithful and unbiased measure of model capability on IPO due diligence.

What would settle it

A human review finding that the generated rubrics systematically omit key S-1 elements such as underwriting risk disclosures or capital-formation details would show the accuracy scores do not reflect true diligence performance.

Figures

Figures reproduced from arXiv: 2606.23032 by Mostapha Benhenda.

Figure 1
Figure 1. Figure 1: The rubric-construction pipeline. After Stage 4 quality evaluation, the rubric is routed to [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
read the original abstract

Finance Agent v2 (by Vals AI) has emerged as the reference benchmark for evaluating both Anthropic Claude and OpenAI ChatGPT frontier language models on financial tasks. However, it narrowly deals with periodic reporting from publicly traded companies (SEC 10-K and 10-Q filings), and its agentic harness relies on naive, unenriched chunk retrieval. Neither the task design nor the retrieval approach addresses the distinct challenges of IPO due diligence. SEC S-1 filings combine historical financial statements, governance structures, pro forma and common-control accounting treatments, capital-formation narratives, and underwriting-sensitive risk disclosures within substantially longer documents than typical periodic filings. That is why we introduce IPO Finance Agent, which extends the Finance Agent v2 framework along two directions: task domain and retrieval architecture. During our experiments, the original Finance Agent v2 harness basically failed to deliver any output related to the SpaceX S-1 filing, due to document length. We therefore had to improve the agentic harness with contextual retrieval, a more realistic and industry-standard approach for long documents. We also built a dataset of 1,000 IPO-diligence questions, and publicly release 70 questions on the SpaceX (SPCX) S-1 filing to support reproducibility, while the remainder are held private to guard against benchmark contamination. In addition, we introduce an evaluator-optimizer pipeline to automatically generate evaluation rubrics for the benchmark: candidate facts are extracted from model answers, consolidated into draft criteria, then automatically audited for omissions, hallucinations, mistiered items, and redundancy, with LLM feedback driving iterative repair, targeted enrichment, and deduplication. Human experts only review final rubrics before deployment. Results show that the best-performing evaluated model, Zhipu GLM-5.2, reaches 79.8% accuracy, and the most cost-efficient model on the resulting Pareto frontier, Xiaomi MiMo-2.5 Pro, reaches slightly lower accuracy (77.2%) at 0.05 USD per query, while exceeding the current Finance Agent v2 leaderboard ceiling, Google Gemini 3.5 Flash at 57.9% for 2.51 USD per query, and undercutting even FABv2's cheapest entry (MiniMax M3: 48.3% at 0.32 USD) on cost-efficiency. Code and data are released on GitHub https://github.com/benstaf/ipoagent

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces IPO Finance Agent as an extension of Finance Agent v2 for evaluating LLMs on IPO due diligence, using the SpaceX S-1 filing. It adds contextual retrieval for long documents, a 1,000-question dataset (70 public on SpaceX), and an automated evaluator-optimizer pipeline for rubric generation via fact extraction, consolidation, LLM audit, and human review. Reported results show Zhipu GLM-5.2 at 79.8% accuracy and Xiaomi MiMo-2.5 Pro at 77.2% (0.05 USD/query), exceeding the Finance Agent v2 ceiling of 57.9%; code and data are released on GitHub.

Significance. If the benchmark holds, the work supplies a new evaluation domain for financial LLMs that targets S-1-specific elements (governance, pro-forma accounting, underwriting risks) absent from periodic-filing benchmarks. The public code/data release and automated rubric pipeline are concrete strengths that enable reproducibility and extension by others.

major comments (3)
  1. [Rubric Generation] § on automated rubric generation (evaluator-optimizer pipeline): no inter-rater agreement, coverage statistics for key IPO topics (governance, pro-forma treatments, risk disclosures), or ablation results are reported. Without these, the headline accuracies (79.8%, 77.2%) cannot be shown to measure genuine diligence capability rather than pipeline artifacts.
  2. [Dataset Construction] Dataset section: the 70 public SpaceX questions are sampled from a 1,000-question pool, yet no sampling protocol, representativeness metrics, or public-vs-private coverage comparison is supplied. This directly affects whether the cost-accuracy Pareto claims generalize beyond the released subset.
  3. [Results] Results and comparison paragraphs: the claim that new results exceed the Finance Agent v2 ceiling of 57.9% treats the two benchmarks as comparable, but the manuscript itself notes that S-1 filings differ in length, accounting treatments, and risk focus from 10-K/10-Q tasks; no alignment or difficulty calibration is provided to support the cross-benchmark ranking.
minor comments (2)
  1. [Agentic Harness] The description of the contextual retrieval harness would be clearer with a short pseudocode block or flowchart.
  2. [Reproducibility] Ensure the GitHub repository link remains stable and includes the exact 70-question set and rubric examples used for the reported numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments identify clear opportunities to strengthen the presentation of the automated rubric pipeline, dataset details, and cross-benchmark comparisons. We address each point below and indicate the revisions we will make in the next version of the manuscript.

read point-by-point responses
  1. Referee: [Rubric Generation] § on automated rubric generation (evaluator-optimizer pipeline): no inter-rater agreement, coverage statistics for key IPO topics (governance, pro-forma treatments, risk disclosures), or ablation results are reported. Without these, the headline accuracies (79.8%, 77.2%) cannot be shown to measure genuine diligence capability rather than pipeline artifacts.

    Authors: We agree that quantitative validation of the rubric pipeline is necessary. The current pipeline already includes an LLM audit step for omissions, hallucinations, and redundancy followed by human expert review of final rubrics. In the revised manuscript we will add: (1) coverage statistics showing the distribution of rubric criteria across core IPO topics (governance, pro-forma accounting, risk disclosures, etc.); (2) ablation results comparing accuracy with and without the consolidation and audit stages; and (3) inter-rater agreement statistics from the human review phase (Cohen’s kappa or equivalent). These additions will be placed in a new subsection under the rubric-generation description and will directly address whether the reported accuracies reflect pipeline artifacts. revision: partial

  2. Referee: [Dataset Construction] Dataset section: the 70 public SpaceX questions are sampled from a 1,000-question pool, yet no sampling protocol, representativeness metrics, or public-vs-private coverage comparison is supplied. This directly affects whether the cost-accuracy Pareto claims generalize beyond the released subset.

    Authors: We accept this criticism. The 1,000-question pool was constructed by systematically extracting factual statements from each major section of the SpaceX S-1 and converting them into questions; the 70 public questions were chosen to span the same topic distribution while remaining self-contained. In revision we will: (a) explicitly describe the sampling protocol and stratification criteria; (b) report topic-level representativeness metrics (e.g., percentage of questions per S-1 section); and (c) provide a high-level comparison of topic coverage between the public 70 and the private remainder without disclosing private content. These details will be added to the Dataset Construction section. revision: yes

  3. Referee: [Results] Results and comparison paragraphs: the claim that new results exceed the Finance Agent v2 ceiling of 57.9% treats the two benchmarks as comparable, but the manuscript itself notes that S-1 filings differ in length, accounting treatments, and risk focus from 10-K/10-Q tasks; no alignment or difficulty calibration is provided to support the cross-benchmark ranking.

    Authors: We acknowledge the overstatement. The manuscript already notes the structural differences between S-1 and periodic filings, yet the comparison language implies direct equivalence. In the revised version we will: (1) remove or qualify the phrase “exceeding the Finance Agent v2 ceiling” and instead state that IPO Finance Agent achieves higher accuracy on a distinct, longer-document task with contextual retrieval; (2) add an explicit limitations paragraph clarifying the absence of difficulty calibration and the non-comparability of the two benchmarks; and (3) frame the cost-efficiency results strictly within the IPO setting. These changes will appear in the Results and Discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark results independent of inputs

full rationale

The paper reports direct empirical accuracies (e.g., GLM-5.2 at 79.8%) obtained by running models on a fixed set of 70 public questions drawn from a SpaceX S-1 filing, scored via an automated rubric pipeline. No equations, fitted parameters, predictions, or derivations are present. The methodology describes question construction and rubric generation but does not reduce any reported result to its own inputs by construction. Comparisons to Finance Agent v2 are external benchmarks, not self-referential. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms; the work rests on standard assumptions that accuracy on a fixed question set measures capability and that LLM-based rubric auditing is reliable after human review.

pith-pipeline@v0.9.1-grok · 6002 in / 1108 out tokens · 38078 ms · 2026-07-01T06:46:58.672170+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Building Effective Agents

    Anthropic. Building Effective Agents. Dec 19, 2024. https://www.anthropic.com/ engineering/building-effective-agents

  2. [2]

    Introducing Contextual Retrieval

    Anthropic. Introducing Contextual Retrieval. Sep 19, 2024. https://www.anthropic.com/ news/contextual-retrieval

  3. [3]

    Claude for Financial Services

    Anthropic. Claude for Financial Services. Jul 15, 2025. https://www.anthropic.com/ news/claude-for-financial-services

  4. [4]

    Advancing Claude for Financial Services

    Anthropic. Advancing Claude for Financial Services. Oct 27, 2025.https://www.anthropic. com/news/advancing-claude-for-financial-services

  5. [5]

    Introducing Claude Opus 4.8

    Anthropic. Introducing Claude Opus 4.8. May 28, 2026. https://www.anthropic.com/ news/claude-opus-4-8

  6. [6]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022

  7. [7]

    FinQA: A Dataset of Numerical Reasoning over Financial Data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021

  8. [8]

    ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering

    Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. 11

  9. [9]

    Securities and Exchange Commission

    U.S. Securities and Exchange Commission. EDGAR — Electronic Data Gathering, Analysis, and Retrieval System. https://www.sec.gov/about/about-edgar

  10. [10]

    Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828, 2025

    Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks. arXiv preprint arXiv:2508.00828, 2025

  11. [11]

    Finance Agent v2 Benchmark

    Vals AI. Finance Agent v2 Benchmark. https://www.vals.ai/benchmarks/fabv2

  12. [12]

    FinanceBench: A New Benchmark for Financial Question Answering

    Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. FinanceBench: A New Benchmark for Financial Question Answering. arXiv preprint arXiv:2311.11944, 2023

  13. [13]

    FinBen: A Holistic Financial Benchmark for Large Language Models

    Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. FinBen: A Holistic Financial Benchmark for Large Language Models. Advances in Neural Information Processing Systems 37 (NeurIPS Datasets and Benchmarks Track), 2024

  14. [14]

    FinRetrieval: A benchmark for financial data retrieval by AI agents

    Eric Y . Kim and Jie Huang. FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents. arXiv preprint arXiv:2603.04403, 2026

  15. [15]

    Time Travel in LLMs: Tracing Data Contamination in Large Language Models

    Shahriar Golchin and Mihai Surdeanu. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. International Conference on Learning Representations (ICLR), 2024

  16. [16]

    Jimenez, John Yang, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    Carlos E. Jimenez, John Yang, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? International Conference on Learning Representations (ICLR), 2024

  17. [17]

    Dense Passage Retrieval for Open-Domain Question Answering

    Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  18. [18]

    ColBERT: Efficient and Effective Passage Search via Con- textualized Late Interaction over BERT

    Omar Khattab and Matei Zaharia. ColBERT: Efficient and Effective Passage Search via Con- textualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2020

  19. [19]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems (NeurIPS), 2020

  20. [20]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research (TMLR), 2023

  21. [21]

    IPO Underpricing

    Alexander Ljungqvist. IPO Underpricing. In B. Espen Eckbo (ed.), Handbook of Corporate Finance: Empirical Corporate Finance, V olume 1, Chapter 7, Elsevier, 2007

  22. [22]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv:2308.03688, 2023

  23. [23]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  24. [24]

    Initial Public Offerings: A Synthesis of the Literature and Directions for Future Research

    Michelle Lowry, Roni Michaely, and Ekaterina V olkova. Initial Public Offerings: A Synthesis of the Literature and Directions for Future Research. Foundations and Trends in Finance, 11(3-4):154–320, 2017

  25. [25]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems 36 (NeurIPS), 2023. 12

  26. [26]

    GAIA: A Benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A Benchmark for General AI Assistants. International Conference on Learning Representations (ICLR), 2024

  27. [27]

    Kimi K2.6 is now open-weight #1 on Finance Agent Bench- mark V2

    Moonshot AI (Kimi). “Kimi K2.6 is now open-weight #1 on Finance Agent Bench- mark V2.” Post on X-Twitter, May 14, 2026. https://x.com/Kimi_Moonshot/status/ 2054803169994272819

  28. [28]

    MultiFinBen: A Multilingual, Multimodal, and Difficulty- Aware Benchmark for Financial LLM Evaluation

    Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, et al. MultiFinBen: A Multilingual, Multimodal, and Difficulty- Aware Benchmark for Financial LLM Evaluation. arXiv preprint arXiv:2506.14028, 2025

  29. [29]

    Introducing GPT-5.5

    OpenAI. Introducing GPT-5.5. April 23, 2026. https://openai.com/index/ introducing-gpt-5-5/

  30. [30]

    Jay R. Ritter. The Long-Run Performance of Initial Public Offerings. Journal of Finance, 46(1):3–27, 1991

  31. [31]

    Ritter and Ivo Welch

    Jay R. Ritter and Ivo Welch. A Review of IPO Activity, Pricing, and Allocations. Journal of Finance, 57(4):1795–1828, 2002

  32. [32]

    Beyond the Imitation Game: Quantify- ing and Extrapolating the Capabilities of Language Models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the Imitation Game: Quantify- ing and Extrapolating the Capabilities of Language Models. Transactions on Machine Learning Research (TMLR), 2023

  33. [33]

    Form S-1 Registration Statement

    Space Exploration Technologies Corp. Form S-1 Registration Statement. U.S. Securities and Exchange Commission, May 20, 2026. https://www.sec.gov/Archives/edgar/data/ 1181412/000162828026036936/spaceexplorationtechnologi.htm

  34. [34]

    VCBench: Benchmarking LLMs in Venture Capital

    Rick Chen, Joseph Ternasky, Afriyie Samuel Kwesi, Ben Griffin, Aaron Ontoyin Yin, Zakari Salifu, Kelvin Amoaba, Xianling Mu, Fuat Alican, and Yigit Ihlamur. VCBench: Benchmarking LLMs in Venture Capital. arXiv preprint arXiv:2509.14448, 2025

  35. [35]

    MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

    Thomson Yen et al. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance. arXiv preprint arXiv:2605.22664, 2026

  36. [36]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045, 2024

  37. [37]

    YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches

    Mostapha Benhenda. YC Bench: A Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches. arXiv preprint arXiv:2604.02378, 2026

  38. [38]

    TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance

    Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

  39. [39]

    Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv preprint arXiv:2310.17631, 2023

  40. [40]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685, 2023. 13