Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

Donghwi Hwang; Hyun Oh Song; Jinuk Kim; Junsoo Byun; Seong-Jin Park

arxiv: 2605.15669 · v1 · pith:22V7R4NRnew · submitted 2026-05-15 · 💻 cs.LG

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

Jinuk Kim , Junsoo Byun , Donghwi Hwang , Seong-Jin Park , Hyun Oh Song This is my paper

Pith reviewed 2026-05-20 20:43 UTC · model grok-4.3

classification 💻 cs.LG

keywords benchmarkLLM agentsDRC script synthesisdesign rule checkingexecution feedbackprogram selectionchip layouts

0 comments

The pith

A benchmark with 1,000 tasks and 13,921 layouts lets execution outcomes score LLM-generated DRC scripts and select the correct ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes Rule2DRC as a large-scale benchmark to test how well LLM agents translate natural language design rules into executable DRC scripts that enforce geometry constraints on chip layouts. Existing evaluations rely on small sets or code similarity rather than running the scripts on actual layouts. The work supplies an evaluation pipeline that scores functional correctness through DRC execution without giving the agent any evaluation layouts as input. It also presents SplitTester, a tester agent that generates discriminative test cases from execution feedback to separate candidate scripts that look similar but behave differently.

Core claim

Rule2DRC supplies 1,000 rule-to-script tasks together with 13,921 held-out chip layouts for execution-based scoring of synthesized DRC scripts. SplitTester uses execution outcomes on those layouts to create test cases that distinguish previously indistinguishable candidate scripts and thereby raises Best-of-N selection accuracy.

What carries the argument

SplitTester, a tester agent that converts execution feedback from DRC runs into discriminative test cases for ranking candidate scripts.

Load-bearing premise

Execution results on held-out layouts produce test cases that reliably separate scripts with different functional behavior even when the synthesis agent never sees those layouts.

What would settle it

A collection of candidate scripts that all pass the SplitTester-generated tests yet produce inconsistent DRC violation reports on a fresh set of independent layouts would falsify the separation claim.

Figures

Figures reproduced from arXiv: 2605.15669 by Donghwi Hwang, Hyun Oh Song, Jinuk Kim, Junsoo Byun, Seong-Jin Park.

**Figure 1.** Figure 1: Overview of the Rule2DRC benchmark. Each task includes a natural language (NL) design rule ri, a ground-truth design rule checking (DRC) script ci, a set of evaluation chip layouts (xij ), and design rule violation labels for each layout ϕ(xij , ci). An LLM agent f(·) generates an executable DRC script from the NL description, using the DRC engine as a tool. We evaluate the generated script by running it o… view at source ↗

**Figure 2.** Figure 2: Illustration of proposed SplitTester agent. SplitTester initially generates a set of test layouts (left blue boxes), then executes each candidate script and groups scripts into clusters Ci based on the output patterns across all current tests. si denotes the score of cluster Ci under the current tests. Gray boxes denote unevaluated scripts. Colored boxes denote evaluated scripts, where scripts share a colo… view at source ↗

**Figure 3.** Figure 3: Illustration of the same design rule implemented with different KLayout DRC grammar. Left: the natural-language rule and a ground-truth script that enforces spacing using the spacing grammar. Right: alternative but equivalent implementations. (a) Enforces the same constraint by resizing the hv ndiff layer using sized grammar. (b) Applies the same resizing logic, but to the N-well layer. (c) Uses the alias … view at source ↗

**Figure 4.** Figure 4: Qualitative examples of Rule2DRC benchmark datapoints. Each example shows a (rule, DRC script) pair from: (a) rules extracted from the SkyWater130 PDK, (b) synthetic multi-layer rules with multiple chained constraints, and (c) synthetic rules designed to use previously unused grammar. existing benchmarks (see [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Pass@N success rate comparison with and without providing the API documentation in context, evaluated on the Rule2DRC benchmark using GPT-OSS-120B. window of the models we target [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Pareto curves on the Rule2DRC benchmark comparing SplitTester (ours) against Generated Tests, S ∗ , and CodeMonkey baselines. Total runtime is measured over all 1,000 Rule2DRC tasks. Each curve shows best-of-N for N ∈ {10, 15, 20}. The runtime is measured using 2 × H100 GPUs for serving LLM and an Intel Xeon Gold 5218R CPU for DRC evaluations. generated expected labels. Instead of scoring candidates agains… view at source ↗

**Figure 7.** Figure 7: Pareto curves on the Rule2DRC benchmark including the LLM-as-Judge baseline. Total runtime is measured over all 1,000 Rule2DRC tasks. Each curve shows best-of-N for N ∈ {10, 15, 20}. The runtime is measured using 2 × H100 GPUs for serving LLM and an Intel Xeon Gold 5218R CPU for DRC evaluations. 150 200 250 15.5 16.5 17.5 18.5 Total cost (min) ↓ Success rate (%) ↑ 140 190 240 35 40 45 Total cost (min) ↓ Su… view at source ↗

**Figure 8.** Figure 8: Pareto curves on the Rule2DRC benchmark including all baselines while varying early stopping parameter (es) from 1 to 3 in SplitTester (ours). Total runtime is measured over all 1,000 Rule2DRC tasks. Each curve shows best-of-N for N ∈ {10, 15, 20}. The runtime is measured using 2 × H100 GPUs for serving LLM and an Intel Xeon Gold 5218R CPU for DRC evaluations. SplitTester with the default early-stopping pa… view at source ↗

**Figure 9.** Figure 9: shows an additional qualitative example task from Rule2DRC benchmark, including the natural-language design rule, the ground-truth DRC script, and the corresponding test layouts. Layouts with violations are highlighted in red, while non-violating layouts are highlighted in green. Natural Language Rule Task 467 This benchmark checks deep-logic conditioning of a single geometric rule for a proxy SKY130 perip… view at source ↗

read the original abstract

Manufacturable chip layouts must satisfy thousands of geometry-based design rules, and design rule checking (DRC) enforces them by running executable DRC scripts on layouts. Translating natural language rules into correct DRC scripts is labor-intensive and requires specialized expertise, motivating LLM agents for DRC script synthesis and debugging. However, existing benchmarks have small evaluation sets and often evaluate scripts by code similarity rather than execution correctness, and prior machine learning-based methods either ignore execution feedback or require labeled test layouts as agent's input. To this end, we introduce Rule2DRC, a large-scale benchmark for DRC script coding agents with 1,000 rule-to-script tasks and 13,921 evaluation chip layouts for execution-based scoring. Rule2DRC provides an evaluation pipeline that measures functional correctness via DRC execution outcomes without requiring evaluation layouts as input to the agent. We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, substantially improving Best-of-N selection performance in this domain. We release the code at https://github.com/snu-mllab/Rule2DRC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rule2DRC gives a larger execution-focused benchmark for DRC script agents plus SplitTester for better candidate selection, but the abstract leaves the actual gains and layout coverage unclear.

read the letter

Rule2DRC introduces a benchmark of 1,000 rule-to-script tasks paired with 13,921 layouts and scores agents by whether the generated DRC scripts actually pass or fail on those layouts. SplitTester adds an execution-feedback loop that tries to build test cases separating scripts that look alike on first pass, which is meant to lift Best-of-N selection. Both pieces target a concrete pain point in chip design where turning natural-language rules into reliable checks is slow and expert-heavy. The shift from code-similarity metrics to real DRC runs is the clearest improvement over earlier small-scale or syntax-only evaluations. Releasing the code and layouts also lowers the barrier for follow-up work. The approach avoids feeding evaluation layouts to the synthesis agent, which keeps the setup honest. The main gap is that the abstract gives no baseline numbers, no error breakdown, and no description of how the 13,921 layouts were chosen or whether they cover the edge cases that actually distinguish scripts. If many functionally different scripts produce identical pass/fail vectors on the available layouts, SplitTester’s separation step adds little. The stress-test worry about incomplete discriminative power therefore lands until the full results show otherwise. This work is aimed at people building or evaluating code agents for specialized engineering domains. A reader who cares about functional correctness in hardware-related tasks would get concrete value from the benchmark itself. I would send it to peer review; the problem is well-motivated, the execution-based framing is sound, and the community can sort out the experimental details once the full paper and code are examined.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Rule2DRC, a benchmark with 1,000 natural-language rule to DRC script synthesis tasks and 13,921 evaluation chip layouts that enable execution-based scoring of functional correctness. It further proposes SplitTester, a tester agent that uses DRC execution feedback on held-out layouts to synthesize discriminative test cases, thereby separating previously indistinguishable candidate scripts and improving Best-of-N selection performance for LLM agents in this domain. The evaluation pipeline measures correctness via DRC runs without supplying evaluation layouts to the synthesis agent, and code is released.

Significance. If the empirical claims hold after controlling for layout coverage and script-difference granularity, the work would provide a valuable large-scale, execution-grounded resource for automated DRC script generation in VLSI design automation. The emphasis on functional correctness over syntactic similarity and the release of both benchmark and evaluation code are clear strengths. SplitTester's use of execution feedback for test-case generation is a concrete, potentially reusable technique for program selection when an oracle is available.

major comments (2)

The central improvement attributed to SplitTester rests on the claim that DRC execution on the 13,921 held-out layouts produces outcome vectors capable of separating functionally distinct scripts. The manuscript should report, per task, the distribution of unique DRC pass/fail vectors across candidate scripts and the fraction of candidate pairs that remain indistinguishable after testing; without this, it is unclear whether the reported Best-of-N gains are driven by genuine separation or by the particular choice of layouts.
The evaluation section must include quantitative results for the claimed improvement (e.g., success rate or pass@N with and without SplitTester), together with baseline agents, number of candidates per task, and an error analysis of cases where SplitTester fails to improve selection. These numbers are load-bearing for the paper's primary contribution and are not summarized in the abstract.

minor comments (2)

The abstract would be strengthened by a single sentence stating the magnitude of the Best-of-N improvement and the number of baselines compared.
Clarify in the methods how the tester agent decides when a generated test case is sufficiently discriminative and how many test cases are typically produced per task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. We agree that additional quantitative breakdowns and analysis will strengthen the manuscript and have revised accordingly.

read point-by-point responses

Referee: The central improvement attributed to SplitTester rests on the claim that DRC execution on the 13,921 held-out layouts produces outcome vectors capable of separating functionally distinct scripts. The manuscript should report, per task, the distribution of unique DRC pass/fail vectors across candidate scripts and the fraction of candidate pairs that remain indistinguishable after testing; without this, it is unclear whether the reported Best-of-N gains are driven by genuine separation or by the particular choice of layouts.

Authors: We agree that reporting the distribution of unique outcome vectors and the fraction of indistinguishable pairs would provide stronger evidence for the discriminative power of the held-out layouts. In the revised manuscript we will add a new table (or appendix section) that, for each of the 1,000 tasks, reports (i) the number of unique DRC pass/fail vectors observed across the generated candidate scripts and (ii) the percentage of candidate pairs that produce identical vectors on the 13,921 layouts. This analysis confirms that the layouts separate the majority of functionally distinct scripts while also identifying the small subset of tasks where separation remains limited; the observed Best-of-N gains are consistent with these statistics. revision: yes
Referee: The evaluation section must include quantitative results for the claimed improvement (e.g., success rate or pass@N with and without SplitTester), together with baseline agents, number of candidates per task, and an error analysis of cases where SplitTester fails to improve selection. These numbers are load-bearing for the paper's primary contribution and are not summarized in the abstract.

Authors: We thank the referee for highlighting this presentational gap. The revised manuscript will expand the evaluation section with explicit tables comparing success rate and pass@N (for N=1,5,10,16) with and without SplitTester. We will state that 16 candidates are generated per task, include direct comparisons against baseline agents (greedy decoding, standard Best-of-N without test-guided selection, and a simple majority-vote baseline), and add a dedicated error-analysis subsection that categorizes cases where SplitTester yields no improvement (e.g., all candidates are functionally equivalent, or the generated test cases do not further discriminate). Key aggregate numbers will also be added to the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and SplitTester rely on external DRC execution feedback independent of fitted predictions or self-referential definitions

full rationale

The paper presents Rule2DRC as an external benchmark with 1000 tasks and 13921 held-out layouts for execution-based scoring of DRC scripts. SplitTester generates discriminative test cases via DRC execution outcomes on those layouts to improve Best-of-N selection. No equations, parameters, or derivations are shown that reduce the claimed improvements to quantities fitted from the evaluation data itself or to self-definitions. The method is self-contained against the external DRC runner and held-out layouts; evaluation does not require layouts as agent input. No self-citation chains or ansatzes are invoked as load-bearing for the core claims. This matches the default non-circular case for benchmark papers using independent execution oracles.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain assumption that DRC execution provides a reliable oracle for functional correctness and that generated test cases can expose behavioral differences between scripts.

axioms (2)

domain assumption DRC execution outcomes on held-out layouts serve as an unbiased measure of script functional correctness.
Invoked when the evaluation pipeline scores scripts solely by running them on the 13,921 layouts.
domain assumption Execution feedback can be used to synthesize test cases that discriminate between functionally distinct but syntactically similar scripts.
Central premise of SplitTester.

pith-pipeline@v0.9.0 · 5747 in / 1424 out tokens · 64626 ms · 2026-05-20T20:43:25.385789+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

IEEE Design & Test of Computers , volume=

The magic VLSI layout system , author=. IEEE Design & Test of Computers , volume=. 2007 , publisher=

work page 2007
[2]

KLayout: Layout Viewer and Editor , howpublished =

Matthias K. KLayout: Layout Viewer and Editor , howpublished =. 2025 , note =

work page 2025
[3]

2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=

An AST-guided LLM Approach for SVRF Code Synthesis , author=. 2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=. 2025 , organization=

work page 2025
[4]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

work page
[5]

2023 60th ACM/IEEE Design Automation Conference (DAC) , pages=

OpenDRC: An efficient open-source design rule checking engine with hierarchical GPU acceleration , author=. 2023 60th ACM/IEEE Design Automation Conference (DAC) , pages=. 2023 , organization=

work page 2023
[6]

Proceedings of the 2022 ACM/IEEE Workshop on Machine Learning for CAD , pages=

Efficient design rule checking script generation via key information extraction , author=. Proceedings of the 2022 ACM/IEEE Workshop on Machine Learning for CAD , pages=

work page 2022
[7]

ACM Transactions on Design Automation of Electronic Systems , volume=

Drc-sg 2.0: Efficient design rule checking script generation via key information extraction , author=. ACM Transactions on Design Automation of Electronic Systems , volume=. 2023 , publisher=

work page 2023
[8]

Proceedings of the 2025 International Symposium on Physical Design , pages=

DRC-Coder: Automated drc checker code generation using LLM autonomous agent , author=. Proceedings of the 2025 International Symposium on Physical Design , pages=

work page 2025
[9]

1987 , publisher=

Computer aids for VLSI design , author=. 1987 , publisher=

work page 1987
[10]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2501.14723 , year=

Codemonkeys: Scaling test-time compute for software engineering , author=. arXiv preprint arXiv:2501.14723 , year=

work page arXiv
[12]

and Stoica, Ion

Li, Dacheng and Cao, Shiyi and Cao, Chengkun and Li, Xiuyu and Tan, Shangyin and Keutzer, Kurt and Xing, Jiarong and Gonzalez, Joseph E. and Stoica, Ion. S *: Test Time Scaling for Code Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025
[13]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page
[14]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019
[15]

The Twelfth International Conference on Learning Representations , year=

Teaching Large Language Models to Self-Debug , author=. The Twelfth International Conference on Learning Representations , year=

work page
[16]

Code generation with AlphaCodium : From prompt engineering to flow engineering

Code generation with alphacodium: From prompt engineering to flow engineering , author=. arXiv preprint arXiv:2401.08500 , year=

work page arXiv
[17]

The Eleventh International Conference on Learning Representations , year=

CodeT: Code Generation with Generated Tests , author=. The Eleventh International Conference on Learning Representations , year=

work page
[18]

Dynamic Scaling of Unit Tests for Code Reward Modeling

Ma, Zeyao and Zhang, Xiaokang and Zhang, Jing and Yu, Jifan and Luo, Sijia and Tang, Jie. Dynamic Scaling of Unit Tests for Code Reward Modeling. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025

work page 2025
[19]

NeurIPS , year=

Co-evolving llm coder and unit tester via reinforcement learning , author=. NeurIPS , year=

work page
[20]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[22]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

work page
[23]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[24]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021

[1] [1]

IEEE Design & Test of Computers , volume=

The magic VLSI layout system , author=. IEEE Design & Test of Computers , volume=. 2007 , publisher=

work page 2007

[2] [2]

KLayout: Layout Viewer and Editor , howpublished =

Matthias K. KLayout: Layout Viewer and Editor , howpublished =. 2025 , note =

work page 2025

[3] [3]

2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=

An AST-guided LLM Approach for SVRF Code Synthesis , author=. 2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=. 2025 , organization=

work page 2025

[4] [4]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

work page

[5] [5]

2023 60th ACM/IEEE Design Automation Conference (DAC) , pages=

OpenDRC: An efficient open-source design rule checking engine with hierarchical GPU acceleration , author=. 2023 60th ACM/IEEE Design Automation Conference (DAC) , pages=. 2023 , organization=

work page 2023

[6] [6]

Proceedings of the 2022 ACM/IEEE Workshop on Machine Learning for CAD , pages=

Efficient design rule checking script generation via key information extraction , author=. Proceedings of the 2022 ACM/IEEE Workshop on Machine Learning for CAD , pages=

work page 2022

[7] [7]

ACM Transactions on Design Automation of Electronic Systems , volume=

Drc-sg 2.0: Efficient design rule checking script generation via key information extraction , author=. ACM Transactions on Design Automation of Electronic Systems , volume=. 2023 , publisher=

work page 2023

[8] [8]

Proceedings of the 2025 International Symposium on Physical Design , pages=

DRC-Coder: Automated drc checker code generation using LLM autonomous agent , author=. Proceedings of the 2025 International Symposium on Physical Design , pages=

work page 2025

[9] [9]

1987 , publisher=

Computer aids for VLSI design , author=. 1987 , publisher=

work page 1987

[10] [10]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2501.14723 , year=

Codemonkeys: Scaling test-time compute for software engineering , author=. arXiv preprint arXiv:2501.14723 , year=

work page arXiv

[12] [12]

and Stoica, Ion

Li, Dacheng and Cao, Shiyi and Cao, Chengkun and Li, Xiuyu and Tan, Shangyin and Keutzer, Kurt and Xing, Jiarong and Gonzalez, Joseph E. and Stoica, Ion. S *: Test Time Scaling for Code Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025

[13] [13]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page

[14] [14]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019

[15] [15]

The Twelfth International Conference on Learning Representations , year=

Teaching Large Language Models to Self-Debug , author=. The Twelfth International Conference on Learning Representations , year=

work page

[16] [16]

Code generation with AlphaCodium : From prompt engineering to flow engineering

Code generation with alphacodium: From prompt engineering to flow engineering , author=. arXiv preprint arXiv:2401.08500 , year=

work page arXiv

[17] [17]

The Eleventh International Conference on Learning Representations , year=

CodeT: Code Generation with Generated Tests , author=. The Eleventh International Conference on Learning Representations , year=

work page

[18] [18]

Dynamic Scaling of Unit Tests for Code Reward Modeling

Ma, Zeyao and Zhang, Xiaokang and Zhang, Jing and Yu, Jifan and Luo, Sijia and Tang, Jie. Dynamic Scaling of Unit Tests for Code Reward Modeling. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025

work page 2025

[19] [19]

NeurIPS , year=

Co-evolving llm coder and unit tester via reinforcement learning , author=. NeurIPS , year=

work page

[20] [20]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025

[21] [21]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025

[22] [22]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

work page

[23] [23]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[24] [24]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021