Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing

Caiwen Ding; Jinwei Tang; Nuo Xu; Pengju Liu; Yu Cao

arxiv: 2605.06936 · v2 · pith:AQ7MR45Hnew · submitted 2026-05-07 · 💻 cs.AR · cs.AI· cs.MA

Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing

Pengju Liu , Nuo Xu , Jinwei Tang , Yu Cao , Caiwen Ding This is my paper

Pith reviewed 2026-05-25 06:48 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.MA

keywords LLM agentsElectronic Design AutomationDRC fixingPPA convergencebenchmarkpost-EDAcircuit design

0 comments

The pith

PostEDA-Bench shows LLM agents achieve only 36.66% success on practical DRC reasoning and 20% on multi-objective PPA tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PostEDA-Bench as a hierarchical benchmark to test LLM-based agents on fixing residual design rule violations and meeting power-performance-area goals after initial EDA tool runs. It organizes 145 tasks into four tiers that range from synthetic single-issue fixes to realistic multi-objective reasoning problems, all paired with automatic verification through EDA toolchains. Tests on eight LLMs under several agent setups find acceptable results on the simpler DRC-Essential and PPA-Mono categories but clear drops on the more representative DRC-Reasoning and PPA-Multi categories. Vision inputs raise DRC performance while the main shortfall in multi-objective cases traces to difficulty with trade-off reasoning rather than missing knob knowledge.

Core claim

PostEDA-Bench is a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains and machine-checkable evaluation. Across eight commercial and open-source LLMs and multiple agent scaffolds, agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but reach only 36.66% success on DRC-Reasoning and 20.00% on PPA-Multi; vision augmentation consistently helps DRC tasks, and trade-off reasoning is the dominant bottleneck in PPA-Multi.

What carries the argument

PostEDA-Bench, the hierarchical benchmark of 145 tasks in four escalating categories with machine-checkable outcomes for post-EDA agent evaluation.

If this is right

Vision inputs raise success rates on DRC fixing tasks.
Trade-off reasoning forms the primary limitation for agents on multi-objective PPA convergence.
Agents remain limited on the realistic complexities that appear in actual post-EDA work.
Hierarchical task structures are required to surface capability gaps that flat or synthetic tests conceal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Addressing the identified reasoning bottlenecks could allow agents to reduce the manual effort required during chip sign-off.
The benchmark structure could support creation of targeted training data for improving agent performance on EDA-specific trade-offs.
Results on this benchmark may correlate with an agent's ability to contribute to larger automated portions of the overall design flow.

Load-bearing premise

The 145 tasks and chosen agent scaffolds are representative of real post-EDA workflows and machine-checkable evaluation accurately captures practical success without missing important failure modes.

What would settle it

A result in which the strongest agent scaffold exceeds 50% success on both the DRC-Reasoning and PPA-Multi task sets under the same machine-checkable protocol would indicate the reported performance drop is not general.

Figures

Figures reproduced from arXiv: 2605.06936 by Caiwen Ding, Jinwei Tang, Nuo Xu, Pengju Liu, Yu Cao.

**Figure 1.** Figure 1: Overview of POSTEDABENCH composition. EDA artifacts rather than source-code recall. For reproducibility and release, both DRC-Bench and PPA-Bench ship with inputs, prompts, metadata, pinned tool setup, and evaluation drivers; final labels are machine-checkable through deterministic EDA tools and report parsers. Detailed release contents, tool versions, regeneration requirements, and limitations are docume… view at source ↗

**Figure 2.** Figure 2: Overview of POSTEDA-BENCH construction process. DRC-clean design we inject 5–15 L1-style violations across rule families, mainly width/via (e.g., M2.W.1, M3.W.1, etc.). Sites are co-located so edits interact; each violation is single-step in isolation, but agents must order edits and re-query DRC since fixes can trigger or eliminate nearby ones. 3.1.3 DRC-Reasoning DRC-Reasoning targets practical residual … view at source ↗

**Figure 3.** Figure 3: Example prompts used to drive the agent in DRC-Bench and PPA-Bench. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Two-objective Pareto fronts for representative PPA-Multi designs. Gray dots denote non-Pareto solutions; blue stars indicate Pareto-optimal points. The combined panels summarize representative period–power and period–area trade-off fronts. let A = {i : Minit i > Mtgt i } be the violated metrics to improve and B = {i : Minit i ≤ Mtgt i } be constraints that must not regress; the per-metric score is si =  … view at source ↗

**Figure 5.** Figure 5: Effect of vision modality on DRC-Bench. SR and VRR are combined in each subfigure for four backbones under text-only and text+vision settings. overall PPA-Multi results. ORFS+Qwen-122B still leads PPA-Multi overall (20.00/67.80 SR/NIS), so structured exploration helps cover the Pareto frontier, but it does not remove the L2 trade-off gap. 4.3 Effect of Vision Modality on DRC-Bench We pair each text-only ba… view at source ↗

**Figure 6.** Figure 6: Effect of iteration cap and Reflexion on Gemma-4-31B-it. Left: DRC combines DRCEssential and DRC-Reasoning. Right: PPA combines PPA-Mono and PPA-Multi. Colors encode metrics; solid lines use the left y-axis and dashed lines use the right y-axis [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Full PPA-Bench per-level performance. SR (top) and NIS (bottom) per construction-time level. Row 1 decomposes PPA-Mono into Performance / Power / Area; row 2 summarizes PPA-Mono per sub-dimension and reports PPA-Multi. knowledge: when a structured sampler is grafted on top of a competent space, multi-objective SR jumps to first place; when it is grafted on top of a wrong space, it amplifies the error rathe… view at source ↗

read the original abstract

LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing EDA-LLM benchmarks, however, omit DRC fixing entirely and rely on flat hierarchies tied to a single toolchain. We introduce PostEDA-Bench, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation. Across eight commercial and open-source LLMs under multiple agent scaffolds, we find that agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning, where the best success rate is 36.66%, and PPA-Multi, where the best success rate is 20.00%; vision augmentation consistently enhances DRC-Bench; and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PostEDA-Bench adds hierarchical DRC and multi-PPA tasks with machine-checkable metrics that prior EDA-LLM work skipped, but the 145 tasks lack external validation against real sign-off flows.

read the letter

The paper's core contribution is PostEDA-Bench: 145 tasks in four categories (DRC-Essential, DRC-Reasoning, PPA-Mono, PPA-Multi) that test LLM agents on post-tool sign-off. It evaluates eight models under several scaffolds using both commercial and open-source EDA toolchains, with automatic PPA and DRC checks. Agents do fine on the simpler synthetic cases but drop to 36.66% on DRC-Reasoning and 20% on PPA-Multi, with vision helping DRC and trade-off reasoning emerging as the main PPA-Multi limiter. That performance pattern is the clearest new signal here. Earlier benchmarks omitted DRC fixing and stayed flat and single-toolchain, so the added hierarchy and categories fill an actual gap in the measurement toolkit. The machine-checkable setup is also a practical plus for reproducibility. The main soft spot is representativeness. The stress-test note flags that the tasks may over-emphasize isolated patterns or synthetic violations rather than production tape-out cases, and nothing in the abstract or described results shows expert review or comparison to real logs that would confirm the tasks track industry workflows. If that assumption does not hold, the reported gaps become benchmark-specific rather than general. Machine-checkable success is clean but can miss downstream timing or multi-tool issues. This work is for groups building or evaluating agents for chip design automation. Readers who need concrete numbers on where current agents break on post-EDA problems will find usable data points. The empirical framing and new task coverage are solid enough to justify sending it to referees, with the expectation that a revision would add validation details on task construction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces PostEDA-Bench, a hierarchical benchmark comprising 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi categories, supported by EDA toolchains and machine-checkable PPA/DRC metrics. It evaluates eight commercial and open-source LLMs under multiple agent scaffolds and reports that agents handle synthetic DRC-Essential and single-objective PPA-Mono tasks reasonably well but degrade sharply on DRC-Reasoning (best success rate 36.66%) and PPA-Multi (best success rate 20.00%), with vision augmentation aiding DRC tasks and trade-off reasoning identified as the dominant bottleneck for multi-objective PPA.

Significance. If the tasks are representative, the benchmark would offer a reproducible framework for assessing LLM agents on post-EDA sign-off, highlighting concrete limitations in complex reasoning and multi-objective optimization. The use of machine-checkable evaluation metrics is a clear strength that supports direct, falsifiable measurements against external toolchains.

major comments (1)

[Task Construction and Evaluation (implied in sections describing the four categories)] The central claim that agents degrade on 'more practical' DRC-Reasoning and PPA-Multi tasks (Abstract) depends on the 145 tasks faithfully representing real post-EDA workflows. No external validation—such as expert review of netlists/violation patterns or comparison against production tape-out logs—is described to confirm that the chosen tasks, DRC patterns, and multi-objective trade-offs match industry cases rather than synthetic artifacts. This assumption is load-bearing for interpreting the reported gaps (36.66% and 20.00%) as general agent limitations.

minor comments (2)

[Abstract] The abstract states results across 'eight commercial and open-source LLMs' but does not enumerate the specific models or scaffolds; listing them explicitly would improve reproducibility.
[Evaluation Protocol] Clarify the exact definition of 'success rate' (e.g., whether it aggregates across multiple agent runs, temperature settings, or requires strict machine-checkable closure without downstream timing interactions).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: The central claim that agents degrade on 'more practical' DRC-Reasoning and PPA-Multi tasks (Abstract) depends on the 145 tasks faithfully representing real post-EDA workflows. No external validation—such as expert review of netlists/violation patterns or comparison against production tape-out logs—is described to confirm that the chosen tasks, DRC patterns, and multi-objective trade-offs match industry cases rather than synthetic artifacts. This assumption is load-bearing for interpreting the reported gaps (36.66% and 20.00%) as general agent limitations.

Authors: We agree that the manuscript provides no external validation (expert review or production-log comparison) to establish that the 145 tasks match real post-EDA workflows. The tasks were constructed hierarchically around standard EDA toolchains and machine-checkable metrics, with DRC-Essential tasks explicitly synthetic; the labels “more practical” for DRC-Reasoning and PPA-Multi rest on internal design choices rather than external grounding. We will revise the abstract, introduction, and add a limitations section to (1) state the synthetic construction explicitly, (2) remove or qualify the “more practical” phrasing, and (3) note that the observed gaps (36.66 % and 20 %) should be interpreted relative to this benchmark rather than as general agent limitations. These changes will be made in the next revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with direct measurements

full rationale

This is an empirical benchmark paper that defines 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi categories, then measures LLM agent success rates using machine-checkable PPA/DRC metrics from external EDA toolchains. No derivations, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. No self-citations support load-bearing claims in any derivation chain. The central results are direct observations against independent tool flows, making the paper self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new benchmark without fitted parameters, new physical entities, or non-standard axioms; it relies on standard assumptions about EDA tool behavior and task representativeness.

axioms (1)

domain assumption EDA tool outputs can be automatically checked for DRC and PPA metrics in a machine-verifiable way.
Invoked when defining machine-checkable evaluation for the 145 tasks.

pith-pipeline@v0.9.0 · 5736 in / 1210 out tokens · 19139 ms · 2026-05-25T06:48:07.305866+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=KuPixIqPiq

work page 2024
[2]

Clark, L., V

T. Clark, L., V . Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric. Asap7: A 7-nm finfet predictive process design kit.Microelectronics Journal, 53(—): 105–115, 2016. doi: 10.1016/j.mejo.2016.04.006

work page doi:10.1016/j.mejo.2016.04.006 2016
[3]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: an embodied multimodal language model. InProceedings of the 40th International Confere...

work page 2023
[4]

Y . Fu, Y . Zhang, Z. Yu, S. Li, Z. Ye, C. Li, C. Wan, and Y . C. Lin. Gpt4aigchip: Towards next-generation AI accelerator design automation via large language models. InIEEE/ACM International Conference on Computer Aided Design, ICCAD 2023, San Francisco, CA, USA, October 28 - Nov. 2, 2023, pages 1–9. IEEE, 2023. doi: 10.1109/ICCAD57390.2023.10323953. UR...

work page doi:10.1109/iccad57390.2023.10323953 2023
[5]

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig. Pal: program- aided language models. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023
[6]

ORFS-agent: Tool-Using Agents for Chip Design Optimization

A. Ghose, A. B. Kahng, S. Kundu, and Z. Wang. Orfs-agent: Tool-using agents for chip design optimization, 2025. URLhttps://arxiv.org/abs/2506.08332

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=VtmBAGCN7o

work page 2024
[8]

Jiang, Q

Z. Jiang, Q. Zhang, C. Liu, L. Cheng, H. Li, and X. Li. Iicpilot: An intelligent integrated circuit backend design framework using open eda, 2024. URL https://arxiv.org/abs/2407.1 2576

work page 2024
[9]

M. Liu, N. Pinckney, B. Khailany, and H. Ren. VerilogEval: evaluating large language models for verilog code generation. In2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2023

work page 2023
[10]

Liu, T.-D

M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktaroglu, B. Bhaskaran, B. Catanzaro, A. Chaudhuri, S. Clay, B. Dally, L. Dang, P. Deshpande, S. Dhodhi, S. Halepete, E. Hill, J. Hu, S. Jain, A. Jindal, B. Khailany, G. Kokai, K. Kunal, X. Li, C. Lind, H. Liu, S. Oberman, S. Omar, G. Pasandi, S. Pratty,...

work page arXiv 2024
[11]

Y . Lu, S. Liu, Q. Zhang, and Z. Xie. Rtllm: An open-source benchmark for design rtl generation with large language model. In2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 722–727. IEEE, 2024. 10

work page 2024
[12]

Y . Lu, H. I. Au, J. Zhang, J. Pan, Y . Wang, A. Li, J. Zhang, and Y . Chen. Autoeda: Enabling eda flow automation through microservice-based llm agents, 2025. URL https://arxiv.or g/abs/2508.01012

work page arXiv 2025
[13]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[14]

Opencores: Home, 2026

OpenCores.org. Opencores: Home, 2026. URLhttps://opencores.org/

work page 2026
[15]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessí, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: language models can teach themselves to use tools. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023
[16]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[17]

Y . Wang, W. Ye, Y . He, Y . Chen, G. Qu, and A. Li. Mcp4eda: Llm-powered model context protocol rtl-to-gdsii automation with backend aware synthesis optimization, 2025. URL https: //arxiv.org/abs/2507.19570

work page arXiv 2025
[18]

H. Wu, Z. He, X. Zhang, X. Yao, S. Zheng, H. Zheng, and B. Yu. ChatEDA: A Large Language Model Powered Autonomous Agent for EDA.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024

work page 2024
[19]

H. Wu, H. Zheng, Z. He, and B. Yu. Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation. InAnnual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025

work page 2025
[20]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[21]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. ReAct: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[22]

Zhang, J

K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin. CodeAgent: Enhancing code generation with tool- integrated agent systems for real-world repo-level coding challenges. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13643–13658, Bangkok, Tha...

work page doi:10.18653/v1/2024.acl-long.737 2024
[23]

The engineer loads the violation in KLayout and inspects the offending geometry alongside its surrounding layers

work page
[24]

The engineer manually constructs the shortest fix sequence using only the editing operations exposed to the agent (add_shape,change_shape,move_cell); inspection-only tool calls are not counted

work page
[25]

The integer count of editing tool calls is recorded as the case’s step count

work page
[26]

The protocol does not produce inter-annotator agreement scores; it is a within-annotator self- consistency procedure

After a one-week delay the engineer re-labels the same case without access to the prior label; cases with disagreeing labels are re-evaluated under the protocol until convergence, and any case that still admits competing minimal fixes is retained at the smaller step count. The protocol does not produce inter-annotator agreement scores; it is a within-anno...

work page 2012
[27]

Limitations

over the same tool surface as the ReAct PPA agent. At each tree node the agent samples PARALLEL_NODE candidate next actions in parallel, an LLM judge scores them, the highest-scored candidate is executed, and the resulting state becomes a new node. Un-executed candidates from every previously-expanded node remain in a global priority queue, so when the cu...

work page 2024
[28]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=KuPixIqPiq

work page 2024

[2] [2]

Clark, L., V

T. Clark, L., V . Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric. Asap7: A 7-nm finfet predictive process design kit.Microelectronics Journal, 53(—): 105–115, 2016. doi: 10.1016/j.mejo.2016.04.006

work page doi:10.1016/j.mejo.2016.04.006 2016

[3] [3]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: an embodied multimodal language model. InProceedings of the 40th International Confere...

work page 2023

[4] [4]

Y . Fu, Y . Zhang, Z. Yu, S. Li, Z. Ye, C. Li, C. Wan, and Y . C. Lin. Gpt4aigchip: Towards next-generation AI accelerator design automation via large language models. InIEEE/ACM International Conference on Computer Aided Design, ICCAD 2023, San Francisco, CA, USA, October 28 - Nov. 2, 2023, pages 1–9. IEEE, 2023. doi: 10.1109/ICCAD57390.2023.10323953. UR...

work page doi:10.1109/iccad57390.2023.10323953 2023

[5] [5]

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig. Pal: program- aided language models. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023

[6] [6]

ORFS-agent: Tool-Using Agents for Chip Design Optimization

A. Ghose, A. B. Kahng, S. Kundu, and Z. Wang. Orfs-agent: Tool-using agents for chip design optimization, 2025. URLhttps://arxiv.org/abs/2506.08332

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=VtmBAGCN7o

work page 2024

[8] [8]

Jiang, Q

Z. Jiang, Q. Zhang, C. Liu, L. Cheng, H. Li, and X. Li. Iicpilot: An intelligent integrated circuit backend design framework using open eda, 2024. URL https://arxiv.org/abs/2407.1 2576

work page 2024

[9] [9]

M. Liu, N. Pinckney, B. Khailany, and H. Ren. VerilogEval: evaluating large language models for verilog code generation. In2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2023

work page 2023

[10] [10]

Liu, T.-D

M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktaroglu, B. Bhaskaran, B. Catanzaro, A. Chaudhuri, S. Clay, B. Dally, L. Dang, P. Deshpande, S. Dhodhi, S. Halepete, E. Hill, J. Hu, S. Jain, A. Jindal, B. Khailany, G. Kokai, K. Kunal, X. Li, C. Lind, H. Liu, S. Oberman, S. Omar, G. Pasandi, S. Pratty,...

work page arXiv 2024

[11] [11]

Y . Lu, S. Liu, Q. Zhang, and Z. Xie. Rtllm: An open-source benchmark for design rtl generation with large language model. In2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 722–727. IEEE, 2024. 10

work page 2024

[12] [12]

Y . Lu, H. I. Au, J. Zhang, J. Pan, Y . Wang, A. Li, J. Zhang, and Y . Chen. Autoeda: Enabling eda flow automation through microservice-based llm agents, 2025. URL https://arxiv.or g/abs/2508.01012

work page arXiv 2025

[13] [13]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[14] [14]

Opencores: Home, 2026

OpenCores.org. Opencores: Home, 2026. URLhttps://opencores.org/

work page 2026

[15] [15]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessí, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: language models can teach themselves to use tools. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023

[16] [16]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[17] [17]

Y . Wang, W. Ye, Y . He, Y . Chen, G. Qu, and A. Li. Mcp4eda: Llm-powered model context protocol rtl-to-gdsii automation with backend aware synthesis optimization, 2025. URL https: //arxiv.org/abs/2507.19570

work page arXiv 2025

[18] [18]

H. Wu, Z. He, X. Zhang, X. Yao, S. Zheng, H. Zheng, and B. Yu. ChatEDA: A Large Language Model Powered Autonomous Agent for EDA.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024

work page 2024

[19] [19]

H. Wu, H. Zheng, Z. He, and B. Yu. Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation. InAnnual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025

work page 2025

[20] [20]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[21] [21]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. ReAct: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[22] [22]

Zhang, J

K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin. CodeAgent: Enhancing code generation with tool- integrated agent systems for real-world repo-level coding challenges. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13643–13658, Bangkok, Tha...

work page doi:10.18653/v1/2024.acl-long.737 2024

[23] [23]

The engineer loads the violation in KLayout and inspects the offending geometry alongside its surrounding layers

work page

[24] [24]

The engineer manually constructs the shortest fix sequence using only the editing operations exposed to the agent (add_shape,change_shape,move_cell); inspection-only tool calls are not counted

work page

[25] [25]

The integer count of editing tool calls is recorded as the case’s step count

work page

[26] [26]

The protocol does not produce inter-annotator agreement scores; it is a within-annotator self- consistency procedure

After a one-week delay the engineer re-labels the same case without access to the prior label; cases with disagreeing labels are re-evaluated under the protocol until convergence, and any case that still admits competing minimal fixes is retained at the smaller step count. The protocol does not produce inter-annotator agreement scores; it is a within-anno...

work page 2012

[27] [27]

Limitations

over the same tool surface as the ReAct PPA agent. At each tree node the agent samples PARALLEL_NODE candidate next actions in parallel, an LLM judge scores them, the highest-scored candidate is executed, and the resulting state becomes a new node. Un-executed candidates from every previously-expanded node remain in a global priority queue, so when the cu...

work page 2024

[28] [28]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page