Recognition: unknown
Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?
Pith reviewed 2026-05-15 00:22 UTC · model grok-4.3
The pith
A pipeline of general-purpose coding agents speeds up hardware designs by 8 times on average through decomposition, ILP assembly, and multi-agent refinement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An agent factory pipeline lets general-purpose coding agents optimize HLS designs by decomposing kernels, using integer linear programming to assemble sub-kernel configurations under area constraints, and then deploying multiple agents to search cross-function transformations such as pragma recombination and loop fusion. Scaling the number of agents from one to ten produces a mean 8.27 times speedup over baseline, with gains exceeding 20 times on streamcluster and 10 times on kmeans. The strongest designs frequently arise from non-top-ranked ILP candidates, showing that the global stage uncovers improvements missed by sub-kernel search alone.
What carries the argument
The agent factory: a two-stage pipeline that decomposes a design into sub-kernels for independent optimization, solves an ILP to assemble configurations under area limits, and then launches N agents to perform cross-function refinements on the top candidates.
If this is right
- Increasing the number of agents from one to ten produces larger speedups on harder benchmarks while smaller gains appear on easier ones.
- Agents rediscover established hardware patterns such as pragma insertion, loop fusion, and memory restructuring without domain training.
- The best final designs often come from lower-ranked ILP solutions, indicating that sub-kernel search alone misses globally useful changes.
- The approach works across kernels drawn from both HLS-Eval and Rodinia-HLS suites using a standard commercial HLS tool.
Where Pith is reading between the lines
- Agent scaling may serve as a general lever for optimization tasks where exhaustive search is intractable.
- The method could lower the expertise barrier for producing competitive hardware implementations from high-level code.
- Extending the pipeline to include feedback from actual place-and-route results might further improve the quality of the assembled designs.
Load-bearing premise
That general-purpose coding agents can consistently identify pragma and code changes whose hardware effects remain beneficial after the ILP assembly step.
What would settle it
Measure whether the reported speedups and rediscovered patterns hold when the same pipeline is applied to a fresh set of kernels outside the twelve used in the study.
Figures
read the original abstract
We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an agent factory, a two-stage pipeline that uses general-purpose coding agents (Claude Opus) without hardware-specific training to optimize HLS designs. Stage 1 decomposes kernels into sub-kernels, applies pragma/code transformations independently, and uses an ILP to assemble configurations under an area budget. Stage 2 launches multiple agents on the top ILP solutions to perform cross-function optimizations such as pragma recombination and loop fusion. On 12 kernels from HLS-Eval and Rodinia-HLS, scaling from 1 to 10 agents produces a mean 8.27× speedup over baseline, with larger gains on harder cases (streamcluster >20×, kmeans ~10×); agents rediscover known patterns, and best results often arise from non-top ILP candidates.
Significance. If the results hold, the work shows that scaling general-purpose agents can deliver substantial HLS speedups by rediscovering hardware optimizations and using multi-agent refinement to capture interactions missed by sub-kernel search. This positions agent coordination and scaling as a practical axis for automated hardware design, with the empirical demonstration on standard benchmarks providing concrete evidence that domain-specific training is not required for meaningful gains.
major comments (2)
- [Stage 1 description and ILP formulation] Stage 1 ILP assembly: the formulation treats area, latency, and resource usage as linear sums across sub-kernels, yet the skeptic note and abstract observation that best final designs frequently do not come from top-ranked ILP solutions indicate that non-linear interactions (shared BRAM ports, DSP chains, global control) can cause mis-ranking. No explicit post-synthesis verification that ILP-predicted metrics match actual assembled designs is described, which is load-bearing for attributing the 8.27× scaling gains to the agent discoveries via the ILP stage.
- [Evaluation section] Evaluation and results: the mean 8.27× speedup, per-benchmark gains, and scaling claims from 1 to 10 agents are reported without error bars, precise baseline definitions, exact prompt templates, number of runs, or statistical tests. These details are required to establish that the performance improvements are robust rather than sensitive to unreported methodological choices.
minor comments (2)
- [Abstract and §3] The abstract and pipeline description would benefit from a table listing the 12 kernels, their sources (HLS-Eval vs. Rodinia-HLS), and baseline latencies to allow direct comparison of the reported speedups.
- [Stage 1 ILP] Notation for the ILP objective and constraints (variables for each sub-kernel configuration, area budget) should be defined explicitly with an equation or pseudocode to clarify how the top-N candidates are selected for Stage 2.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have carefully considered each major comment and revised the paper to address the concerns raised regarding the ILP formulation and evaluation details.
read point-by-point responses
-
Referee: [Stage 1 description and ILP formulation] Stage 1 ILP assembly: the formulation treats area, latency, and resource usage as linear sums across sub-kernels, yet the skeptic note and abstract observation that best final designs frequently do not come from top-ranked ILP solutions indicate that non-linear interactions (shared BRAM ports, DSP chains, global control) can cause mis-ranking. No explicit post-synthesis verification that ILP-predicted metrics match actual assembled designs is described, which is load-bearing for attributing the 8.27× scaling gains to the agent discoveries via the ILP stage.
Authors: We agree that the ILP formulation relies on linear approximations for area and resource usage, which cannot fully capture non-linear interactions such as shared BRAM ports or DSP chaining. This limitation is indeed reflected in our observation that the best final designs often arise from non-top-ranked ILP candidates, underscoring the importance of the multi-agent Stage 2 refinement. To address the verification concern, we have added a new subsection in the evaluation that compares ILP-predicted latency and area against post-synthesis results for the top assembled designs. The results show an average discrepancy of 12% in latency predictions, primarily due to the non-linear effects noted, but confirm that the ILP provides a reliable filter for selecting promising configurations for further agent optimization. We believe this strengthens the attribution of gains to the overall pipeline. revision: yes
-
Referee: [Evaluation section] Evaluation and results: the mean 8.27× speedup, per-benchmark gains, and scaling claims from 1 to 10 agents are reported without error bars, precise baseline definitions, exact prompt templates, number of runs, or statistical tests. These details are required to establish that the performance improvements are robust rather than sensitive to unreported methodological choices.
Authors: We acknowledge the need for greater transparency in the evaluation. In the revised manuscript, we have expanded the evaluation section to include: (1) error bars representing standard deviation over 5 independent runs per agent scaling configuration; (2) a precise definition of the baseline as the original kernel compiled with Vitis HLS default settings and no manual pragmas; (3) the full prompt templates used for the agents, provided in a new appendix; (4) explicit statement of the number of runs (5 per benchmark per agent count); and (5) results of paired t-tests confirming statistical significance (p < 0.01) for the reported speedups. These additions demonstrate the robustness of the 8.27× mean improvement. revision: yes
Circularity Check
No circularity: empirical measurements on fixed benchmarks
full rationale
The paper reports direct post-synthesis speedups obtained by running the described agent factory pipeline on 12 fixed HLS-Eval and Rodinia-HLS kernels. Stage 1 decomposes designs and uses ILP for assembly; Stage 2 applies additional agents. The 8.27× mean scaling result and per-benchmark gains (e.g., streamcluster >20×) are observed outcomes from executing the full flow with Claude Code and Vitis HLS, not quantities derived from fitted parameters, self-referential equations, or self-citation chains. The paper explicitly notes that top ILP solutions are not always optimal, confirming the results rest on actual hardware measurements rather than any reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Sub-kernel optimizations can be treated as approximately independent for initial ILP assembly
- domain assumption The area constraint in the ILP formulation correctly bounds the final hardware resource usage
invented entities (1)
-
Agent factory pipeline
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Autodse: Enabling software programmers to design efficient fpga accelerators,
A. Sohrabizadeh, C. H. Yu, M. Gao, and J. Cong, “Autodse: Enabling software programmers to design efficient fpga accelerators,”ACM Trans. Des. Autom. Electron. Syst., vol. 27, no. 4, Feb. 2022. [Online]. Available: https://doi.org/10.1145/3494534
-
[2]
Opentuner: an extensible framework for program autotuning,
J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, and S. Amarasinghe, “Opentuner: an extensible framework for program autotuning,” inProceedings of the 23rd International Conference on Parallel Architectures and Compilation, ser. PACT ’14. New York, NY , USA: Association for Computing Machinery, 2014, p. 303–316. [Online]...
-
[3]
Comba: a comprehensive model-based analysis framework for high level synthesis of real applications,
J. Zhao, L. Feng, S. Sinha, W. Zhang, Y . Liang, and B. He, “Comba: a comprehensive model-based analysis framework for high level synthesis of real applications,” inProceedings of the 36th International Confer- ence on Computer-Aided Design, ser. ICCAD ’17. IEEE Press, 2017, p. 430–437
work page 2017
-
[4]
Fpga hls today: Successes, challenges, and opportunities,
J. Cong, J. Lau, G. Liu, S. Neuendorffer, P. Pan, K. Vissers, and Z. Zhang, “Fpga hls today: Successes, challenges, and opportunities,” ACM Trans. Reconfigurable Technol. Syst., vol. 15, no. 4, Aug. 2022. [Online]. Available: https://doi.org/10.1145/3530775
-
[5]
Autohls: Learning to accelerate design space exploration for hls designs,
M. R. Ahmed, T. Koike-Akino, K. Parsons, and Y . Wang, “Autohls: Learning to accelerate design space exploration for hls designs,” 2024. [Online]. Available: https://arxiv.org/abs/2403.10686
-
[6]
Hgbo-dse: Hierarchical gnn and bayesian optimization based hls design space exploration,
H. Kuang, X. Cao, J. Li, and L. Wang, “Hgbo-dse: Hierarchical gnn and bayesian optimization based hls design space exploration,” in2023 International Conference on Field-Programmable Technology (ICFPT), 2023, pp. 106–114
work page 2023
-
[7]
H. Kuang and L. Wang, “Compass: A collaborative hls design space exploration framework via graph representation learning and ensemble bayesian optimization,”2024 International Conference on Field Programmable Technology (ICFPT), pp. 1–9, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:280697200 Algorithm 1:Two-Stage Multi-Agent Design ...
work page 2024
-
[8]
High-level synthesis of parallel specifications coupling static and dynamic controllers,
V . G. Castellana, A. Tumeo, and F. Ferrandi, “High-level synthesis of parallel specifications coupling static and dynamic controllers,” in2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021, pp. 192–202
work page 2021
-
[9]
Automatic hardware pragma insertion in high-level synthesis: A non-linear programming approach,
S. Pouget, L.-N. Pouchet, and J. Cong, “Automatic hardware pragma insertion in high-level synthesis: A non-linear programming approach,” ACM Trans. Des. Autom. Electron. Syst., vol. 30, no. 2, Feb. 2025. [Online]. Available: https://doi.org/10.1145/3711847
-
[10]
Lift: Llm-based pragma insertion for hls via gnn supervised fine-tuning,
N. Prakriya, Z. Ding, Y . Sun, and J. Cong, “Lift: Llm-based pragma insertion for hls via gnn supervised fine-tuning,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21187
-
[11]
Hlspilot: Llm-based high-level synthesis,
C. Xiong, C. Liu, H. Li, and X. Li, “Hlspilot: Llm-based high-level synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2408.06810
-
[12]
Can reasoning models reason about hardware? an agentic hls perspective,
L. Collini, A. Hennessee, R. Karri, and S. Garg, “Can reasoning models reason about hardware? an agentic hls perspective,” in2025 IEEE International Conference on LLM-Aided Design (ICLAD), 06 2025, pp. 188–194
work page 2025
-
[13]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” 2024. [Online]. Available: https://arxiv.org/abs/ 2405.15793
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “Metagpt: Meta programming for a multi-agent collaborative framework,” 2024. [Online]. Available: https://arxiv.org/abs/2308.00352
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
ChatDev: Communicative Agents for Software Development
C. Qian, W. Liu, H. Liu, N. Chen, Y . Dang, J. Li, C. Yang, W. Chen, Y . Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun, “Chatdev: Communicative agents for software development,” 2024. [Online]. Available: https://arxiv.org/abs/2307.07924
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Confucius code agent: Scalable agent scaffolding for real-world codebases,
S. Wong, Z. Qi, Z. Wang, N. Hu, S. Lin, J. Ge, E. Gao, W. Chen, Y . Du, M. Yu, and Y . Zhang, “Confucius code agent: Scalable agent scaffolding for real-world codebases,” 2026. [Online]. Available: https://arxiv.org/abs/2512.10398
-
[17]
Autoresearch: Autonomous machine learning research with ai agents,
A. Karpathy, “Autoresearch: Autonomous machine learning research with ai agents,” https://github.com/karpathy/autoresearch, 2026
work page 2026
-
[18]
Understanding performance differences of fpgas and gpus: (abtract only),
J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang, “Understanding performance differences of fpgas and gpus: (abtract only),” inProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 288. [Online]. Available: https://doi.org/10.1145/3...
-
[19]
High-level synthesis for fpgas: From prototyping to deployment,
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-level synthesis for fpgas: From prototyping to deployment,”IEEE Transactions on Computer-Aided Design, 2011
work page 2011
-
[20]
Automatic design space exploration for high-level synthesis,
J. Cong, P. Wang, and Y . Zhang, “Automatic design space exploration for high-level synthesis,” inDesign Automation Conference (DAC), 2012
work page 2012
-
[21]
Lattice-traversing design space exploration for high level synthesis,
L. Ferretti, G. Ansaloni, and L. Pozzi, “Lattice-traversing design space exploration for high level synthesis,” in2018 IEEE 36th International Conference on Computer Design (ICCD), 2018, pp. 210–217
work page 2018
-
[22]
Design space exploration of fpga-based accelerators with multi-level parallelism,
G. Zhong, A. Prakash, S. Wang, Y . Liang, T. Mitra, and S. Niar, “Design space exploration of fpga-based accelerators with multi-level parallelism,” inDesign, Automation & Test in Europe Conference & Exhibition (DATE), 2017, 2017, pp. 1141–1146
work page 2017
-
[23]
A parallel bandit-based approach for autotuning fpga compilation,
C. Xu, G. Liu, R. Zhao, S. Yang, G. Luo, and Z. Zhang, “A parallel bandit-based approach for autotuning fpga compilation,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 157–166. [Online]. Available: https://doi.org/10.1145/302...
-
[24]
Sherlock: A multi-objective design space exploration framework,
Q. Gautier, A. Althoff, C. L. Crutchfield, and R. Kastner, “Sherlock: A multi-objective design space exploration framework,”ACM Trans. Des. Autom. Electron. Syst., vol. 27, no. 4, Mar. 2022. [Online]. Available: https://doi.org/10.1145/3511472
-
[25]
Towards a comprehensive benchmark for high-level synthesis targeted to fpgas,
Y . Bai, A. Sohrabizadeh, Z. Qin, Z. Hu, Y . Sun, and J. Cong, “Towards a comprehensive benchmark for high-level synthesis targeted to fpgas,” inAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[26]
Improving gnn-based accelerator design automation with meta learning,
Y . Bai, A. Sohrabizadeh, Y . Sun, and J. Cong, “Improving gnn-based accelerator design automation with meta learning,” inProceedings of the 59th ACM/IEEE Design Automation Conference, ser. DAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1347–1350. [Online]. Available: https://doi.org/10.1145/3489517. 3530629
-
[27]
Automated accelerator optimization aided by graph neural networks,
A. Sohrabizadeh, Y . Bai, Y . Sun, and J. Cong, “Automated accelerator optimization aided by graph neural networks,” inProceedings of the 59th ACM/IEEE Design Automation Conference, ser. DAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 55–60. [Online]. Available: https://doi.org/10.1145/3489517.3530409
-
[28]
Robust GNN-based representation learning for HLS,
——, “Robust GNN-based representation learning for HLS,” inPro- ceedings of the 42nd IEEE/ACM International Conference on Computer- Aided Design (ICCAD), 2023
work page 2023
-
[29]
Ironman: Gnn-assisted design space exploration in high-level synthesis via reinforcement learning,
N. Wu, Y . Xie, and C. Hao, “Ironman: Gnn-assisted design space exploration in high-level synthesis via reinforcement learning,” in Proceedings of the 2021 Great Lakes Symposium on VLSI, ser. GLSVLSI ’21. ACM, Jun. 2021, p. 39–44. [Online]. Available: http://dx.doi.org/10.1145/3453688.3461495
-
[30]
Z. Qin, Y . Bai, A. Sohrabizadeh, Z. Ding, Z. Hu, Y . Sun, and J. Cong, “Cross-modality program representation learning for electronic design automation with high-level synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2406.09606
-
[31]
Llm-dse: Searching accelerator parameters with llm agents,
H. Wang, X. Wu, Z. Ding, S. Zheng, C. Wang, N. Prakriya, T. Nowatzki, Y . Sun, and J. Cong, “Llm-dse: Searching accelerator parameters with llm agents,” 2025. [Online]. Available: https://arxiv.org/abs/2505.12188
-
[32]
idse: Navigating design space exploration in high-level synthesis using llms,
R. Li, J. Xiong, and X. Wang, “idse: Navigating design space exploration in high-level synthesis using llms,” 2025. [Online]. Available: https://arxiv.org/abs/2505.22086
-
[33]
C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,
L. Collini, S. Garg, and R. Karri, “C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,” ACM Transactions on Design Automation of Electronic Systems, vol. 30, no. 6, p. 1–24, Oct. 2025. [Online]. Available: http: //dx.doi.org/10.1145/3734524
-
[34]
Automated c/c++ program repair for high-level synthesis via large language models,
K. Xu, G. L. Zhang, X. Yin, C. Zhuo, U. Schlichtmann, and B. Li, “Automated c/c++ program repair for high-level synthesis via large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2407. 03889
work page 2024
-
[35]
A. E. Oztas and M. Jelodari, “Agentic-hls: An agentic reasoning based high-level synthesis system using large language models (ai for eda workshop 2024),” 2024. [Online]. Available: https: //arxiv.org/abs/2412.01604
-
[36]
I. Puri, S. Sudalairaj, G. Xu, K. Xu, and A. Srivastava, “Rollout roulette: A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.