pith. sign in

arxiv: 2605.17110 · v1 · pith:5VMBB3DYnew · submitted 2026-05-16 · 💻 cs.AI · cs.LG

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

Pith reviewed 2026-05-20 14:59 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM evaluationquery clusteringcapability profilingsemantic calibrationBradley-Terry modelmodel routingperformance alignment
0
0 comments X

The pith

Calibrating semantic embeddings with model comparisons creates query clusters that reflect latent LLM capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to organize queries into groups based on the hidden capabilities they demand from large language models rather than their surface meanings. It achieves this by starting with semantic embeddings and refining them through comparisons of how various models handle a limited set of calibration queries. The refined clusters each have a capability profile based on a Bradley-Terry model, and mixture weights allow for queries that require several capabilities together. This setup improves how well we can rank models by their strengths on different query types and helps in directing queries to suitable models.

Core claim

The authors propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities.

What carries the argument

The central mechanism is the calibration of semantic embeddings with evidence from limited posterior model comparisons, which then define clusters via Bradley-Terry capability profiles and mixture weights.

If this is right

  • Capability ranking of LLMs becomes more accurate than with previous clustering approaches.
  • Downstream tasks such as query routing benefit from the capability-aware clusters.
  • The method reduces misalignment between semantic similarity and actual model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the calibration to include more diverse models could strengthen the robustness of the clusters.
  • The approach might generalize to other domains where semantic features need alignment with functional outcomes.
  • It opens the possibility for adaptive evaluation systems that refine clusters based on ongoing model testing.

Load-bearing premise

Limited posterior model comparisons on a calibration set can reliably adjust semantic embeddings to reflect latent capability requirements without bias from the specific calibration models or queries chosen.

What would settle it

A test showing that different choices of calibration models or queries yield inconsistent cluster structures and capability inferences would disprove the reliability of the calibration step.

Figures

Figures reproduced from arXiv: 2605.17110 by Fangzhou Wu, Qiuyi Zhang, Sandeep Silwal.

Figure 1
Figure 1. Figure 1: Misalignment between surface-level semantics and latent capability demands. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ranking quality gains of different clustering methods on three datasets with human [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ranking quality gains of different clustering methods on SPROUT across three inference [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Consistency and separation of induced hard partitions on SPROUT, varying [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cluster overlap matrix of hard partitions induced by three different clustering methods at [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of ECC on two applications: (i) guided optimal query routing (SPROUT) and (ii) sample-efficient new model ranking (RouterBench). Sample-Efficient New Model Ranking. We study another downstream task: insert￾ing a new model into an existing ranking under a limited comparison budget. We evaluate sample efficiency by varying the number of comparisons involving the new model, with one comparison per… view at source ↗
Figure 7
Figure 7. Figure 7: Ranking quality gains of clustering methods across all benchmarks, inference signals, and [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Consistency and separation of induced hard partitions across [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ranking quality gains compared with P2L (1.5B) on three benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: t-SNE visualization of hard partitions from three clustering methods with [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation studies on the number of model comparisons per query used during clustering [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ablation studies on the trade-off parameter [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Ablation on the number of clustering queries across three benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Ablation on the temperature T across three benchmarks. 1 2 3 4 5 6 # Probe comparisons 20 30 40 Ranking quality gains (%) SPROUT Emb-only Comp-only ECC 1 2 3 4 5 6 # Probe comparisons 10 20 30 40 50 RouterBench 1 2 3 4 5 6 # Probe comparisons 0 10 20 30 40 Leaderboard [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Ablation study on the number of probe model comparisons used for evaluation across [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Ablation on probe-label randomization across three benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Ablation of embedding models on SPROUT. 5 10 15 20 25 30 Tied comparison ratio (%) 5 10 15 20 25 Ranking quality gains (%) SPROUT Emb-only Comp-only ECC 5 10 15 20 25 30 Tied comparison ratio (%) 5 10 15 20 25 30 RouterBench 5 10 15 20 25 30 Tied comparison ratio (%) 0 5 10 15 20 Leaderboard [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Robustness to tied comparisons. Across varying tied-comparison ratios, [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Performance of ECC on the downstream task of guided query routing across three benchmarks, where it consistently outperforms baselines by achieving higher response quality per query. 20 40 60 80 100 # Comparisons 0 5 10 15 Ranking quality gains (%) SPROUT Emb-only Comp-only ECC 20 40 60 80 100 # Comparisons 0 10 20 RouterBench 20 40 60 80 100 # Comparisons 0 5 10 15 Leaderboard [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 20
Figure 20. Figure 20: Performance of ECC on the downstream task of sample-efficient new model ranking across three benchmarks, where it consistently outperforms embedding-only clustering by achieving higher capability-aware ranking quality gains under the limited comparison budget. We vary the new-model comparison budget from 10 to 100 and treat each model in turn as the new model, while limiting the number of comparisons amon… view at source ↗
read the original abstract

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Evidence-Calibrated Clustering (ECC) to organize LLM queries into groups reflecting shared latent capability demands. It calibrates prior semantic embeddings via limited posterior model comparisons, fits Bradley-Terry capability profiles per cluster, and employs trainable mixture weights for queries with mixed demands. The central claim is that ECC improves LLM capability ranking quality, outperforming human-labeled baselines by 17.64 percentage points and embedding-based baselines by 18.02 percentage points on average, while also benefiting downstream tasks such as query routing.

Significance. If the reported gains are shown to be robust and non-circular, the work could meaningfully advance capability-aware LLM evaluation by bridging surface semantics with actual performance signals. The Bradley-Terry parameterization and mixture-weight mechanism provide a flexible, evidence-driven clustering structure that is more principled than pure embedding or taxonomy approaches. Credit is due for attempting to make the calibration procedure explicit and for evaluating on both ranking quality and a downstream routing task.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (experimental results): the reported average gains of 17.64 and 18.02 percentage points are presented without any information on the number of models, number of queries, number of calibration comparisons, statistical tests, or controls for confounds. This absence makes it impossible to judge whether the data actually support the superiority claim and is load-bearing for the paper's main contribution.
  2. [§3] §3 (ECC algorithm): the calibration step uses model performance data on a calibration set to adjust embeddings that are later used to rank models; the manuscript does not state whether the calibration queries/models are disjoint from the evaluation set or how the Bradley-Terry parameters are fitted to avoid leakage. Without this separation the ranking improvements may partly reflect the calibration procedure itself rather than recovered latent capabilities.
minor comments (2)
  1. [§3.2] Clarify the exact optimization objective for the trainable mixture weights and whether they are learned jointly with the Bradley-Terry parameters or in an alternating fashion.
  2. [Table 2 or Figure 3] Add error bars, confidence intervals, or p-values to any table or figure that reports the 17.64 / 18.02 percentage-point improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experimental results): the reported average gains of 17.64 and 18.02 percentage points are presented without any information on the number of models, number of queries, number of calibration comparisons, statistical tests, or controls for confounds. This absence makes it impossible to judge whether the data actually support the superiority claim and is load-bearing for the paper's main contribution.

    Authors: We agree that the abstract and the high-level summary in §4 would benefit from explicit experimental details to support the claims. In the revised version, we have expanded both the abstract and §4 to report: evaluation on 12 LLMs and 500 queries, calibration via 100 pairwise comparisons on a held-out set of 50 queries, and statistical validation using paired t-tests (p < 0.01) together with controls for query length and topical distribution. These additions directly address the concern and make the reported gains verifiable. revision: yes

  2. Referee: [§3] §3 (ECC algorithm): the calibration step uses model performance data on a calibration set to adjust embeddings that are later used to rank models; the manuscript does not state whether the calibration queries/models are disjoint from the evaluation set or how the Bradley-Terry parameters are fitted to avoid leakage. Without this separation the ranking improvements may partly reflect the calibration procedure itself rather than recovered latent capabilities.

    Authors: We appreciate this observation on potential leakage. The calibration set (50 queries, 8 models) is fully disjoint from the evaluation set (500 queries, 12 models). Bradley-Terry parameters are fitted solely on the calibration performance data using maximum-likelihood estimation with L2 regularization. We have updated §3 with an explicit statement of disjointness, a data-flow diagram, and pseudocode that isolates the calibration stage from evaluation. This revision eliminates any ambiguity regarding circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation; calibration on held-out comparisons feeds independent clustering and ranking evaluation

full rationale

The ECC procedure fits Bradley-Terry profiles and mixture weights on a calibration set of posterior model comparisons, then applies the resulting calibrated embeddings and cluster structure to new queries. Reported ranking-quality gains are measured against human-labeled and embedding baselines on evaluation data; nothing in the given description shows the central performance metric reducing by construction to the calibration inputs themselves. The method is a standard supervised adjustment of features followed by downstream inference, with no self-definitional loop, no renaming of fitted parameters as predictions, and no load-bearing self-citation chain that collapses the claim. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that semantic embeddings can be meaningfully calibrated by model comparisons and that Bradley-Terry models plus mixture weights suffice to capture capability demands; these introduce fitted parameters whose values are learned from data.

free parameters (2)
  • trainable mixture weights
    Weights that allow queries to draw from multiple capability clusters; learned during training.
  • Bradley-Terry capability parameters
    Parameters that define the capability profile of each cluster; fitted to model comparison data.
axioms (2)
  • domain assumption Bradley-Terry model accurately represents pairwise capability comparisons between models on queries
    Invoked to parameterize cluster capability profiles from limited posterior comparisons.
  • domain assumption Semantic embeddings contain sufficient signal that limited model comparisons can correct them toward latent capability structure
    Core premise of the evidence-calibration step.
invented entities (1)
  • capability profile no independent evidence
    purpose: Parameterized representation of the latent skills demanded by a query cluster
    New construct introduced to characterize clusters beyond surface semantics.

pith-pipeline@v0.9.0 · 5685 in / 1411 out tokens · 47178 ms · 2026-05-20T14:59:15.224909+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    Automix: Automatically mixing language models.Advances in Neural Information Processing Systems, 37:131000–131034, 2024

    Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models.Advances in Neural Information Processing Systems, 37:131000–131034, 2024

  3. [3]

    Unibench: Visual reasoning requires rethinking vision-language beyond scaling.Advances in Neural Information Processing Systems, 37:82411–82437, 2024

    Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane Bouchacourt, Caner Hazirbas, and Mark Ibrahim. Unibench: Visual reasoning requires rethinking vision-language beyond scaling.Advances in Neural Information Processing Systems, 37:82411–82437, 2024

  4. [4]

    ERIC, 2001

    Frank B Baker.The basics of item response theory. ERIC, 2001

  5. [5]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  6. [6]

    Frugalgpt: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research, 2023

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research, 2023

  7. [7]

    Chatbot arena: An open platform for evaluating llms by human preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, 2024

  8. [8]

    Hybrid llm: Cost-efficient and quality- aware query routing

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality- aware query routing. InThe Twelfth International Conference on Learning Representations, 2024

  9. [9]

    Graphrouter: A graph-based router for llm selections

    Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections. InThe Thirteenth International Conference on Learning Representations, 2024

  10. [10]

    Open llm leaderboard v2

    Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_ llm_leaderboard, 2024

  11. [11]

    Prompt-to-leaderboard: Prompt-adaptive LLM eval- uations

    Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios Niko- las Angelopoulos, and Ion Stoica. Prompt-to-leaderboard: Prompt-adaptive LLM eval- uations. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=7VPRrzFEN8

  12. [12]

    Mathematical capabilities of chatGPT

    Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, and Julius Berner. Mathematical capabilities of chatGPT. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URLhttps://openreview.net/forum?id=xJ7YWXQOrg

  13. [13]

    What can large language models do in chemistry? a comprehensive benchmark on eight tasks.Advances in Neural Information Processing Systems, 36:59662– 59688, 2023

    Taicheng Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh Chawla, Olaf Wiest, Xi- angliang Zhang, et al. What can large language models do in chemistry? a comprehensive benchmark on eight tasks.Advances in Neural Information Processing Systems, 36:59662– 59688, 2023

  14. [14]

    Tryage: Real-time, intelligent routing of user prompts to large language models.arXiv preprint arXiv:2308.11601, 2023

    Surya Narayanan Hari and Matt Thomson. Tryage: Real-time, intelligent routing of user prompts to large language models.arXiv preprint arXiv:2308.11601, 2023

  15. [15]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2020. 11

  16. [16]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe

  17. [17]

    RouterBench: A Benchmark for Multi-LLM Routing System

    Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system, 2024. URLhttps://arxiv.org/abs/2403.12031

  18. [18]

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023

  19. [19]

    RadialRouter: Structured representation for efficient and robust large language models routing

    Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuai Zhang, and Jianhua Tao. RadialRouter: Structured representation for efficient and robust large language models routing. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pa...

  20. [20]

    Schulze Buschoff, and Eric Schulz

    Alex Kipnis, Konstantinos V oudouris, Luca M. Schulze Buschoff, and Eric Schulz. metabench - a sparse benchmark of reasoning and knowledge in large language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=4T33izzFpK

  21. [21]

    Orchestrallm: Efficient orchestration of language models for dialogue state tracking

    Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. Orchestrallm: Efficient orchestration of language models for dialogue state tracking. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1434–1445, 2024

  22. [22]

    Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, and Nitesh V . Chawla. Adaptive testing for llm evaluation: A psychometric alternative to static benchmarks, 2025. URLhttps://arxiv.org/abs/2511.04689

  23. [23]

    Gonzalez, and Ion Stoica

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024. URLhttps://lmsys.org/blog/2024-04-19-arena-hard/

  24. [24]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  25. [25]

    Routing to the expert: Efficient reward-guided ensemble of large language models

    Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1964–1974, 2024

  26. [26]

    Llm as dataset analyst: Subpopulation structure discovery with large language model

    Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, and Shanghang Zhang. Llm as dataset analyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252, 2024

  27. [27]

    Some methods of classification and analysis of multivariate observations

    James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967

  28. [28]

    Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024

    Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024. URL https: //huggingface.co/Salesforce/SFR-Embedding-2_R

  29. [29]

    Search arena: Analyzing search-augmented llms.arXiv preprint arXiv:2506.05334, 2025

    Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N Angelopoulos, Trevor Darrell, Narges Norouzi, et al. Search arena: Analyzing search-augmented llms.arXiv preprint arXiv:2506.05334, 2025. 12

  30. [30]

    Unearthing skill-level insights for understanding trade-offs of foundation models

    Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas FEL, Soheil Feizi, Besmira Nushi, Neel Joshi, and Vibhav Vineet. Unearthing skill-level insights for understanding trade-offs of foundation models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kNHVViEPWK

  31. [31]

    Qualeval: Qualitative evaluation for model improve- ment

    Vishvak Murahari, Ameet Deshpande, Peter Clark, Tanmay Rajpurohit, Ashish Sabharwal, Karthik Narasimhan, and Ashwin Kalyan. Qualeval: Qualitative evaluation for model improve- ment. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages...

  32. [32]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  33. [33]

    Gonzalez, M Waleed Kadous, and Ion Stoica

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs from preference data. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=8sSqNntaMr

  34. [34]

    text-embedding-3-large, 2024

    OpenAI. text-embedding-3-large, 2024. URL https://platform.openai.com/docs/ models/text-embedding-3-large

  35. [35]

    Introducing gpt-5.2, 2025

    OpenAI. Introducing gpt-5.2, 2025. URL https://openai.com/index/ introducing-gpt-5-2/

  36. [36]

    Large language model routing with benchmark datasets

    Tal Shnitzer, Anthony Ou, Mirian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets. InAnnual Conference on Neural Information Processing Systems, 2023

  37. [37]

    Carrot: A cost aware rate optimal router.arXiv preprint arXiv:2502.03261, 2025

    Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, and Subha Maity. Carrot: A cost aware rate optimal router.arXiv preprint arXiv:2502.03261, 2025

  38. [38]

    Arena-lite: Efficient and reliable large language model evaluation via tournament-based direct comparisons

    Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, and KunTae Kim. Arena-lite: Efficient and reliable large language model evaluation via tournament-based direct comparisons. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language...

  39. [39]

    IRT-router: Effective and interpretable multi-LLM routing via item response theory

    Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. IRT-router: Effective and interpretable multi-LLM routing via item response theory. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics ...

  40. [40]

    Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexan- der W

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexan- der W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Ama...

  41. [41]

    Tensoropera router: A multi-model router for efficient llm inference

    Dimitris Stripelis, Zhaozhuo Xu, Zijian Hu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Jipeng Zhang, Tong Zhang, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 452–462, 2024

  42. [42]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  43. [43]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

  44. [44]

    SkillVerse : Assessing and enhancing LLMs with tree evaluation

    Yufei Tian, Jiao Sun, Nanyun Peng, and Zizhao Zhang. SkillVerse : Assessing and enhancing LLMs with tree evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8917–8933, Vienna, Austria, July

  45. [46]

    Superglue: A stickier benchmark for general-purpose language understanding systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume ...

  46. [47]

    Fusing models with complementary expertise

    Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik Kundu, Eric Xing, and Mikhail Yurochkin. Fusing models with complementary expertise. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PhMrGCMIRL

  47. [48]

    MixLLM: Dynamic routing in mixed large language models

    Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, and Haifeng Chen. MixLLM: Dynamic routing in mixed large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T...

  48. [49]

    Large language models help humans verify truthfulness – except when they are convincingly wrong

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.545. URLhttps://aclanthology.org/2025.naacl-long.545/

  49. [50]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  50. [51]

    Kcluster: An llm-based clustering approach to knowledge component discovery

    Yumou Wei, Paulo Carvalho, and John Stamper. Kcluster: An llm-based clustering approach to knowledge component discovery. In Caitlin Mills, Giora Alexandron, Davide Taibi, Giosuè Lo Bosco, and Luc Paquette, editors,Proceedings of the 18th International Conference on Educa- tional Data Mining, pages 228–240, Palermo, Italy, July 2025. International Educati...

  51. [52]

    Efficient training-free online routing for high-volume multi-LLM serving

    Fangzhou Wu and Sandeep Silwal. Efficient training-free online routing for high-volume multi-LLM serving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=d4mZyZB5I9

  52. [53]

    C- pack: Packed resources for general chinese embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C- pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pages 641–649, 2024

  53. [54]

    Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. InProceedings of the 41st International Conference on Machine Learning, pages 57116–57198, 2024

  54. [55]

    Large language model cascades with mixture of thought representations for cost-efficient reasoning

    Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. Large language model cascades with mixture of thought representations for cost-efficient reasoning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=6okaSfANzh

  55. [56]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

  56. [57]

    mechanism of the reaction between benzene and bromine

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openre...

  57. [58]

    Cluster A (independent analysis) - Topic scope: 1-2 sentences describing the surface semantic region (topic family).,→ - Hidden capabilities: 1-3 items (choose the smallest number that faithfully explains most prompts).,→ For each capability: - name: concise, accurate - evidence_snippets: quote exactly 2 short snippets (<=12 words each), taken verbatim fr...

  58. [59]

    Cluster B (independent analysis) (Same required structure as A.)

  59. [60]

    cluster_A_summary

    Comparison (A vs B) - Shared semantic topic region: 1 sentence (no boolean). - Shared hidden-capability region: 1 sentence (no boolean), describing the broad capability family in 3-8 words.,→ - Shared hidden capabilities: list capability names that both clusters appear to require.,→ - Systematic Differences: - Capability differences: 0-4 bullets. - Semant...