pith. machine review for the scientific record. sign in

arxiv: 2605.08518 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords competition retrospectivemulti-agent orchestrationleaderboard analysishidden evaluationguardrail improvementsexecution methodsparticipation patterns
0
0 comments X

The pith

In this 2025 multi-agent orchestration challenge, execution success came mostly from guardrail improvements like response selection and fallback handling rather than new agent architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper revisits the full set of results from a privacy-aware competition on industrial multi-agent systems. It merges public and private leaderboards, server logs, registration records, and source code from top entries to determine what the evaluation actually measured. Public planning scores reached a clear ceiling with no further gain from richer prompts, while execution scores on hidden tests often diverged sharply from public ones and sometimes improved. The analysis also shows that one scoring component had almost no effect on rankings and that active participation came from a small core of teams. Readers would care because the findings clarify which practical behaviors competitions reward and how evaluations can be adjusted to better track real progress.

Core claim

The central claim is that successful execution methods improved guardrails such as response selection, contamination cleanup, fallback procedures, and context control instead of introducing novel agent architectures, while hidden evaluation produced different outcomes from public leaderboards and the composite score gave negligible weight to one of its terms.

What carries the argument

The multi-source retrospective that combines final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, and verified source trees to measure score correlations and classify rewarded behaviors.

If this is right

  • Public planning performance saturated at 72.73 percent and gained nothing from richer prompts.
  • Public and private execution scores correlated negatively at r equals -0.13, so some systems that scored 45.45 percent publicly reached 63.64 percent on the hidden set.
  • The matching term contributed at most 0.05 points to the composite score on a 0-1 scale and could swap the top two teams if rescaled.
  • Only 11 teams reached full rankings out of 149 registrations, with 52.3 percent of deduplicated registrations listing multiple usernames.
  • Top execution methods succeeded through better response selection, contamination cleanup, fallback, and context control rather than architectural novelty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Releasing versioned source trees and logs would let others replicate and extend the classification of what worked.
  • Adding skill-level diagnostics could separate planning strength from execution strength more clearly than overall scores do.
  • Making the composite scoring scale-aware would prevent small terms from becoming numerically inert.
  • The negative correlation between public and private execution scores indicates that public test distributions may not match the hidden ones.

Load-bearing premise

The combined data from rank sheets, logs, registrations, exports, and source trees provide a complete and unbiased picture of all participating methods and behaviors.

What would settle it

Re-running the hidden execution evaluation on the top public submissions and finding that the same teams rank highest, or inspecting the source trees of top execution entries and finding novel agent architectures instead of guardrail changes, would show the main claim is incorrect.

Figures

Figures reproduced from arXiv: 2605.08518 by Chathurangi Shyalika, Dhaval Patel, James Rayfield, Ling Yue, Nianjun Zhou, Shuxin Lin, Suryanarayana Reddy Yarrabothula.

Figure 1
Figure 1. Figure 1: CODS 2025 ASSETOPSBENCH competition framework. Submissions are evaluated across Planning and Execution tracks against four domain agents on multimodal industrial data. The blue star marks the transition from the open Development to the hidden Evaluation phase. enabling reliability-aware datasets [8], large-scale stress testing [14], and contamination-resistant protocols [6]. Previous retrospectives in near… view at source ↗
Figure 2
Figure 2. Figure 2: Editable TODO regions per track, mirroring the released starter templates By design, the editable surface in Track 1 concentrates the variation controlled by the participant in the prompt and planning code, while Track 2 concentrates on workflow execution and context handling. This separation is a central methodological feature of the competition, although residual variation can still arise from packaging … view at source ↗
Figure 3
Figure 3. Figure 3: CODS 2025 ASSETOPSBENCH leaderboards. Full rankings in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Benchmark fingerprint. Normalised computational cost profile per agent domain across five axes. Raw values and provenance is provided in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Track 1 (left) scatters with 32% noise; Track 2 (right) forms three clusters with [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Archetype taxonomy at K=5. Bar length: cluster share (%). Filled dots: medoids stable across both encoders. Italic terms: class-TF–IDF top tokens. At K=5 (see [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Alignment with the Agent-Eval Checklist. Green = satisfied, amber = partial, red = gap. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Execution environment. Agent and evaluator run in isolated containers. Alignment with the Agent-Eval Checklist. A re￾cent practitioner report proposes a minimum-bar checklist for trustworthy agent benchmarks [26], pub￾lished after our competition ended [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Official Agentic-AI competition website. Link: [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Agentic AI Challenge Advertisement Webpage at CODS 2025. Link: [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Registration for for Agentic AI Challenge. Link: [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Hierarchical cluster structure of observed failure modes. The tree is organized across three abstraction levels: a single root node (left), seven failure mode clusters at the middle level (one colour per cluster), and eight title variants at the leaf level (right). Count badges above each cluster node show the total number of instances belonging to that cluster. Lack of Final Answer (cluster 3, centre) is… view at source ↗
Figure 13
Figure 13. Figure 13: Competition overview. (a) Score distributions per team split by track (blue = Task Planning, orange = Task Execution). White diamonds denote the per-team mean; medal icons identify the top-three finishers by final combined score. Task Planning exhibits wider variance (IQR up to 27 points for high-submission teams) than Task Execution. (b) Best planning score vs. best execution score for each team. Circles… view at source ↗
Figure 14
Figure 14. Figure 14: Score trajectories over the competition window. Each dot is one finished submission; the step line shows the running best score for each team. Light grey dots in the background represent all submissions aggregated across teams, providing a reference for the overall scoring landscape. (a) Task Planning track: most teams converge within the first two weeks; Team A is a notable exception with a sustained asc… view at source ↗
Figure 15
Figure 15. Figure 15: Weekly submission activity heatmap. Each cell records the number of Finished submissions by a given team (rows) during a given calendar week (columns). Colour intensity encodes volume: white = zero submissions, dark blue = highest activity. Cell values are annotated directly; values ≥5 are printed in white for contrast. Two distinct engagement patterns are visible: sustained high-activity teams (Team A, T… view at source ↗
Figure 16
Figure 16. Figure 16: Learning dynamics and submission reliability. (a) Cumulative best score as a function of submission number, ordered chronologically within each team. Each marker is one finished sub￾mission; lines connect consecutive submissions per team. Most teams plateau within five submissions; Team A is the notable exception with continued improvement past submission 20. Team H plateaus early despite 31 total submiss… view at source ↗
Figure 17
Figure 17. Figure 17: Top-rank stability under score reparameterization. Each cell shows the identity of the top-ranked team for a given combination of execution weight α and t-match scaling factor s. The officially top-ranked team is stable only within a narrow band around the released configuration; the top-ranked team changes across large regions of the parameter space. Mean Kendall τ¯ = 0.61 (SD = 0.19) between the officia… view at source ↗
Figure 18
Figure 18. Figure 18: Distribution of public leaderboard scores. Scores cluster at a small set of discrete values arising from binary per-scenario evaluation (∼9.09 points per scenario with 11 scenarios total). Four teams share the maximum planning-track score of 72.73, illustrating that the public evaluation signal cannot discriminate between these teams on the planning track; a second cluster appears at 63.64 points. This sa… view at source ↗
Figure 19
Figure 19. Figure 19: shows the distribution of rank shifts for both tracks. In the planning track, the distribution is approximately centred at zero with Spearman ρplan = 0.62 (p = 0.04), indicating moderate and statistically significant public-to-hidden alignment: public planning scores provide a meaningful but imperfect signal of generalization to hidden planning scenarios. Maximum absolute shifts reach 6 positions in the p… view at source ↗
Figure 20
Figure 20. Figure 20: Public vs. hidden score comparison per track. Planning (left panel) shows a recognisable positive relationship between public and hidden scores. Execution (right panel) shows scatter consistent with near-zero correlation, confirming that public execution scores do not provide a reliable proxy for hidden performance. Shaded regions indicate the 95% confidence band around the regression line where fitted. T… view at source ↗
Figure 21
Figure 21. Figure 21: Distribution of computational cost metrics across Phase 1 executions. All three metrics are right-skewed. Token consumption: median 54K, mean 110K, CV = 1.47, indicating a small number of high-token executions dominate aggregate cost. API call depth and wall-clock duration are similarly right-skewed but with lower dispersion, reflecting structural constraints of the agent architecture rather than per-toke… view at source ↗
Figure 22
Figure 22. Figure 22: Cost comparison between Phase 1 (Development) and Phase 2 (Evaluation). Error bars denote one standard deviation. Token consumption (p = 0.82) and wall-clock duration (p = 0.27) are statistically indistinguishable, confirming comparable computational demands across scenario sets. API call depth differs significantly (p = 0.004): Phase 1 averages 11.5 calls vs. 10.0 in Phase 2, consistent with more explora… view at source ↗
Figure 23
Figure 23. Figure 23: Mean execution cost per agent domain across all three metrics. Error bars denote one standard deviation. WO is the most expensive domain across token consumption and API calls; E2E has the highest wall-clock duration despite moderate token consumption, exposing an orchestration latency cost invisible to token-based analysis. TSFM is consistently the least expensive domain. G.5 Scenario-Level Difficulty an… view at source ↗
Figure 24
Figure 24. Figure 24: Execution cost comparison between single-agent and multi-agent (E2E) partitions. Single-agent executions consume nearly twice the tokens of multi-agent executions (121K vs. 63K, t = 7.18, p < 0.001), driven by the dominance of token-intensive WO Decision Support scenarios in the single-agent pool. Multi-agent executions have higher wall-clock duration despite lower token consumption, reflecting orchestrat… view at source ↗
Figure 25
Figure 25. Figure 25: Mean token consumption per scenario ordered by difficulty. Left panel: Phase 1 (Development); right panel: Phase 2 (Evaluation). Error bars denote one standard deviation across runs. Colours indicate agent domain. The 18-fold difficulty range between the hardest scenario (Q424, WO, 373K tokens) and the easiest (Q201, TSFM, 20K tokens) confirms extreme difficulty heterogeneity. Spearman ρ = 0.89 (p < 0.001… view at source ↗
Figure 26
Figure 26. Figure 26: Feature correlation heatmap showing relationships between trajectory features and task [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Tool entropy vs task success. Successful executions exhibit lower entropy. [PITH_FULL_IMAGE:figures/full_fig_p039_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Number of steps vs task success. Successful executions are shorter. [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Token usage vs task success. Successful executions use fewer tokens. [PITH_FULL_IMAGE:figures/full_fig_p040_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: K-means cluster-quality sweeps. Solid lines: [PITH_FULL_IMAGE:figures/full_fig_p040_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Encoder-stability heatmap: shared top-5 medoids per matched cluster pair between MINILM (rows) and BGE (columns) at K=5. Chance overlap under an independent random assignment is ≈ 1 medoid per pair. Track 2 (right) has nearly twice the planning-side stability (11 vs. 6 shared medoids) [PITH_FULL_IMAGE:figures/full_fig_p041_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Distribution of cluster sizes across failure-mode labels. Most clusters are small (size 1–3), [PITH_FULL_IMAGE:figures/full_fig_p042_32.png] view at source ↗
read the original abstract

Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive{} challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on \assetops{}. We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion \assetopslive{} system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning ($r{=}0.69$) but negatively in execution ($r{=}{-}0.13$), with several 45.45\% public execution systems reaching 63.64\% on the hidden set. Third, the \tmatch{} term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3\% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a retrospective analysis of the CODS 2025 AssetOpsBench Challenge, integrating final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion system paper, and verified planning-track source trees. It reports five observational results: saturation of the public planning leaderboard at 72.73% with no gains from richer prompts; moderate positive correlation (r=0.69) between public and private planning scores but negative correlation (r=-0.13) in execution, with some public 45.45% systems reaching 63.64% privately; negligible contribution of the tmatch term to the composite score; the competition being substantively team-based despite account registration (52.3% multi-username deduplication, reduction to 24 non-zero and 11 fully ranked); and successful execution methods primarily improving guardrails (response selection, contamination cleanup, fallback, context control) rather than novel agent architectures.

Significance. If the empirical observations hold, the paper usefully documents what the AssetOpsBench evaluation actually rewarded, including the limited informativeness of public leaderboards and the outsized role of execution guardrails. This can inform the design of future multi-agent orchestration benchmarks and highlight the need for scale-aware scoring and versioned artifact releases. The multi-source data integration is a positive feature for transparency in competition retrospectives.

major comments (1)
  1. [Fifth result paragraph] Fifth result: The claim that successful execution methods 'mostly improve guardrails—response selection, contamination cleanup, fallback, and context control—rather than novel agent architectures' is based on inspection of best-submission exports and verified source trees from the 11 fully ranked teams. No coding scheme, inter-rater protocol, or quantitative breakdown (e.g., how many of the 11 were guardrail-only vs. architecture-plus-guardrail) is supplied, so the 'mostly' qualifier cannot be independently verified from the released artifacts. The pipeline from 149 registrations to 11 ranked teams (with 52.3% multi-username deduplication) introduces a plausible selection bias if teams with more architectural novelty were disproportionately filtered out by hidden evaluation.
minor comments (2)
  1. [Abstract and results] The abstract and results sections supply limited detail on the exact statistical procedures (e.g., how correlations were computed, handling of ties, or robustness checks) and any bias diagnostics applied to the filtered datasets.
  2. [Throughout] Notation such as tmatch, assetopslive, and assetops should be defined on first use, as readers may not be familiar with the specific challenge terminology.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and positive assessment of the manuscript's transparency and potential utility for future benchmark design. We address the single major comment below and will incorporate revisions to improve verifiability.

read point-by-point responses
  1. Referee: [Fifth result paragraph] Fifth result: The claim that successful execution methods 'mostly improve guardrails—response selection, contamination cleanup, fallback, and context control—rather than novel agent architectures' is based on inspection of best-submission exports and verified source trees from the 11 fully ranked teams. No coding scheme, inter-rater protocol, or quantitative breakdown (e.g., how many of the 11 were guardrail-only vs. architecture-plus-guardrail) is supplied, so the 'mostly' qualifier cannot be independently verified from the released artifacts. The pipeline from 149 registrations to 11 ranked teams (with 52.3% multi-username deduplication) introduces a plausible selection bias if teams with more architectural novelty were disproportionately filtered out by hidden evaluation.

    Authors: We agree that the current presentation relies on a qualitative inspection of the 11 source trees and best-submission exports without an explicit coding scheme or inter-rater protocol, which limits independent verification of the 'mostly' qualifier. We will revise the manuscript to add a dedicated subsection (or appendix table) that categorizes each of the 11 ranked teams' approaches according to whether they primarily modified guardrails (response selection, contamination cleanup, fallback, context control), introduced novel agent architectures, or combined both. This will include counts and brief descriptions drawn directly from the verified artifacts. Regarding selection bias, the 52.3% multi-username deduplication and reduction from 149 registrations to 11 fully ranked teams is already reported in the manuscript; we acknowledge that teams with greater architectural novelty could have been filtered by the hidden evaluation or by not submitting valid entries. We will expand the limitations paragraph to explicitly discuss this conditioning on successful ranked submissions and note that the analysis cannot rule out bias against more novel (but perhaps less robust) architectures. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical summarization of external competition data

full rationale

The paper reports five results derived from external artifacts (rank sheets, server logs, registrations, best-submission exports, and verified source trees) without any derivations, equations, parameter fitting, or predictions. The fifth result classifies execution methods via code inspection of the 11 ranked trees; this is an empirical categorization, not a self-definitional reduction or fitted input renamed as prediction. No self-citation chains or uniqueness theorems are load-bearing for the central claims. The analysis is self-contained against the released competition data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical observations from competition data rather than new theoretical constructs or fitted parameters.

axioms (1)
  • domain assumption The provided competition artifacts (rank sheets, server logs, team registrations, source trees) are representative and complete for analyzing all submissions.
    The retrospective relies on these sources to draw conclusions about what the evaluation rewarded.

pith-pipeline@v0.9.0 · 5656 in / 1190 out tokens · 74520 ms · 2026-05-12T01:43:36.351344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Proceedings of the 13th international conference on data science (cods 2025)

    ACM India. Proceedings of the 13th international conference on data science (cods 2025). In ACM India Joint International Conference on Data Science and Management of Data, Pune, India, 2025. Association for Computing Machinery. URL https://ikdd.acm.org/cods-2 025/. Formerly known as CODS-COMAD

  2. [2]

    Generative large model security challenge (tianchi platform)

    Alibaba Tianchi. Generative large model security challenge (tianchi platform). https: //tianchi.aliyun.com/competition/entrance/532362, 2025. Accessed: 2026-04-12

  3. [3]

    CODS 2025 Competition Release

    AssetOpsBench. CODS 2025 Competition Release. h t t p s : / / g i t h u b . c o m / I B M /AssetOpsBench/tree/neurips_2026_codabench , 2025. GitHub repository, neurips_2026_codabenchbranch

  4. [4]

    A scenario-driven benchmark for industrial asset operations and maintenance

    AssetOpsBench. A scenario-driven benchmark for industrial asset operations and maintenance. https://huggingface.co/datasets/ibm-research/AssetOpsBench , 2026. Version 1.0

  5. [5]

    AssetOpsBench Docker images, 2025

    AssetOpsBench Team. AssetOpsBench Docker images, 2025. URL https://quay.i o/assetopsbench . Available at quay.io/assetopsbench/assetopsbench-basic and quay.io/assetopsbench/assetopsbench-extra

  6. [6]

    Math- arena: Evaluating llms on uncontaminated math competitions

    Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  7. [7]

    General agent evaluation.ICLR 2026 Workshop Agents in the Wild: Safety, Security, and Beyond (AIWILD), 2026

    Elron Bandel, Asaf Yehudai, Lilach Eden, Yehoshua Sagron, Yotam Perlitz, Elad Venezian, Natalia Razinkov, Natan Ergas, Shlomit Shachor Ifergan, Segev Shlomov, et al. General agent evaluation.ICLR 2026 Workshop Agents in the Wild: Safety, Security, and Beyond (AIWILD), 2026

  8. [8]

    Fair universe higgsml uncertainty dataset and competition

    Wahid Bhimji, Ragansu Chakkappai, Po-Wen Chang, Yuan-Tang Chou, Sascha Diefenbacher, Jordan Dudley, Ibrahim Elsharkawy, Steven Farrell, Aishik Ghosh, Cristina Giordano, et al. Fair universe higgsml uncertainty dataset and competition. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  9. [9]

    Why do multi-agent llm systems fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  10. [10]

    Codalab and codabench newsletter: What happened in 2025?, 2025

    CodaBench. Codalab and codabench newsletter: What happened in 2025?, 2025. URL https://docs.codabench.org/dev/Newsletters_Archive/CodaLab-in-2025/

  11. [11]

    Multi-Agent AI Competition on Industry 4.0 Tasks

    CODS 2025 AssetOps. Multi-Agent AI Competition on Industry 4.0 Tasks. https://www.co dabench.org/competitions/10206/, 2025. Codabench competition page

  12. [12]

    Medai: Evaluating txagent’s therapeutic agentic reasoning in the neurips cure-bench competition.arXiv preprint arXiv:2512.11682, 2025

    Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, and Wolfgang Nejdl. Medai: Evaluating txagent’s therapeutic agentic reasoning in the neurips cure-bench competition.arXiv preprint arXiv:2512.11682, 2025

  13. [13]

    Dataset and lessons learned from the 2024 satml llm capture-the-flag competition.Advances in Neural Information Processing Systems, 37:36914–36937, 2024

    Edoardo Debenedetti, Javier Rando, Daniel Paleka, Fineas Silaghi, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, et al. Dataset and lessons learned from the 2024 satml llm capture-the-flag competition.Advances in Neural Information Processing Systems, 37:36914–36937, 2024

  14. [14]

    erasing the invisible

    Mucong Ding, Bang An, Tahseen Rabbani, Chenghao Deng, Anirudh Satheesh, Souradip Chakraborty, Mehrdad Saberi, Yuxin Wen, Kyle Rui Sang, Aakriti Agrawal, et al. A technical report on “erasing the invisible”: The 2024 neurips competition on stress testing image wa- termarks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datas...

  15. [15]

    Cure-bench: Benchmarking ai reasoning for therapeutic decision-making at scale.https://curebench.ai,

    Shanghua Gao, Richard Yuxuan Zhu, Zhenglun Kong, Xiaorui Su, Curtis Ginder, Sufian Aldogom, Ishita Das, Taylor Evans, Theodoros Tsiligkaridis, and Marinka Zitnik. Cure-bench: Benchmarking ai reasoning for therapeutic decision-making at scale.https://curebench.ai,

  16. [16]

    Accessed: 2026

    NeurIPS 2025 Competition and Benchmark. Accessed: 2026

  17. [17]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure

    Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022

  18. [18]

    Workshop on deepfake detection, localization, and inter- pretability (ijcai 2025)

    IJCAI 2025 Workshop Organizers. Workshop on deepfake detection, localization, and inter- pretability (ijcai 2025). https://deepfake-workshop-ijcai2025.github.io/main/in dex.html, 2025. Accessed: 2026-04-12

  19. [19]

    Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J.J. Allaire, Rishi Bom- masani, Magda Dubois, Gillian Hadfield, Andy Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, and Arvind Narayanan. Open-world evaluations for measuring frontier ai capabilities. https://cruxevals.c...

  20. [20]

    Learning to run a power network challenge: a retrospective analysis

    Antoine Marot, Benjamin Donnot, Gabriel Dulac-Arnold, Adrian Kelly, Aidan O’Sullivan, Jan Viebahn, Mariette Awad, Isabelle M Guyon, Patrick Panciatici, and Camilo Romero. Learning to run a power network challenge: a retrospective analysis. InNeural Information Processing Systems, 2021. URLhttps://api.semanticscholar.org/CorpusID:232110622

  21. [21]

    Meta crag-mm challenge: Comprehensive rag benchmark for multi-modal multi-turn question answering

    Meta Reality Labs and Meta GenAI. Meta crag-mm challenge: Comprehensive rag benchmark for multi-modal multi-turn question answering. https://www.aicrowd.com/challenges /meta-crag-mm-challenge-2025, 2025. KDD Cup 2025 Challenge. Accessed: 2026

  22. [22]

    Assetopsbench: Benchmarking ai agents for task automation in industrial asset operations and maintenance, 2025

    Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Chathurangi Shyalika Jayakody, Suryanarayana R Yarrabothula, Roman Vaculin, Natalia Martinez, Fearghal O’Donncha, and Jayant Kalagnanam. Assetopsbench: A real-world evaluation benchmark for ai-driven task automation in industrial asset management.arXiv preprint arXiv:2506.03828, 2025

  23. [23]

    Assetopsbench-live: Privacy-aware online evaluation of multi- agent performance in industrial operations

    Dhaval Patel, Nianjun Zhou, Shuxin Lin, James Rayfield, Chathurangi Shyalika, and Surya- narayana Reddy Yarrabothula. Assetopsbench-live: Privacy-aware online evaluation of multi- agent performance in industrial operations. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 41658–41660, 2026

  24. [24]

    Results of the big ann: Neurips’23 competition.39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks, 2024

    Harsha Vardhan Simhadri, Martin Aumüller, Amir Ingber, Matthijs Douze, George Williams, Magdalen Dobson Manohar, Dmitry Baranchuk, Edo Liberty, Frank Liu, Ben Landrum, et al. Results of the big ann: Neurips’23 competition.39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks, 2024

  25. [25]

    Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition.Advances in Neural Information Processing Systems, 37:11545–11569, 2024

    George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition.Advances in Neural Information Processing Systems, 37:11545–11569, 2024

  26. [26]

    Polina Turishcheva, Paul G Fahey, Michaela Vystrˇcilová, Laura Hansel, Rachel Froebe, Kayla Ponder, Yongrong Qiu, Konstantin F Willeke, Mohammad Bashiri, Ruslan Baikulov, et al. Retrospective for the dynamic sensorium competition for predicting large-scale mouse primary visual cortex activity from videos.Advances in Neural Information Processing Systems, ...

  27. [27]

    How we broke top ai agent benchmarks: And what comes next

    Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song. How we broke top ai agent benchmarks: And what comes next. https://moogician.github.io/blog/2026/ trustworthy-benchmarks-cont/, 2026. Accessed: 2026-04-12

  28. [28]

    Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform.Patterns, 3(7), 2022

    Zhen Xu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, and Isabelle Guyon. Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform.Patterns, 3(7), 2022. 11

  29. [29]

    Ml4cfd competition: Results and retrospective analysis

    Mouadh Yagoubi, David Danan, Milad Leyli-abadi, Jocelyn Ahmed Mazari, Jean-Patrick Brunet, Abbas Kabalan, Fabien Casenave, Yuxin Ma, Giovanni Catalani, Jean Fesquet, et al. Ml4cfd competition: Results and retrospective analysis. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  30. [30]

    Crag - comprehensive rag benchmark

    Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, and Xin Luna Dong. Crag...

  31. [31]

    How reliable is language model micro-benchmarking? InInternational Conference on Learning Representations (ICLR), 2026

    Gregory Yauney, Shahzaib Saqib Warraich, and Swabha Swayamdipta. How reliable is language model micro-benchmarking? InInternational Conference on Learning Representations (ICLR), 2026

  32. [32]

    Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025. 12 A Technical appendices and supplementary material This appendix presents a str...