pith. machine review for the scientific record. sign in

arxiv: 2603.25719 · v2 · submitted 2026-03-26 · 💻 cs.AI · cs.AR· cs.LG

Recognition: unknown

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:22 UTC · model grok-4.3

classification 💻 cs.AI cs.ARcs.LG
keywords high-level synthesiscoding agentshardware optimizationagent scalingpragma transformationILP assemblyHLS-Evalcross-function optimization
0
0 comments X

The pith

A pipeline of general-purpose coding agents speeds up hardware designs by 8 times on average through decomposition, ILP assembly, and multi-agent refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how far ordinary coding agents can push hardware performance starting from high-level algorithmic code. It builds a two-stage agent factory that first splits a design into sub-kernels, optimizes each with code and pragma changes, and solves an integer program to pick combinations that fit area limits. In the second stage, multiple agents explore further improvements that cross function boundaries. Results on twelve standard kernels show that moving from one to ten agents delivers an average 8.27 times speedup, with much larger gains on difficult cases, and that the agents locate known effective patterns without any hardware-specific training.

Core claim

An agent factory pipeline lets general-purpose coding agents optimize HLS designs by decomposing kernels, using integer linear programming to assemble sub-kernel configurations under area constraints, and then deploying multiple agents to search cross-function transformations such as pragma recombination and loop fusion. Scaling the number of agents from one to ten produces a mean 8.27 times speedup over baseline, with gains exceeding 20 times on streamcluster and 10 times on kmeans. The strongest designs frequently arise from non-top-ranked ILP candidates, showing that the global stage uncovers improvements missed by sub-kernel search alone.

What carries the argument

The agent factory: a two-stage pipeline that decomposes a design into sub-kernels for independent optimization, solves an ILP to assemble configurations under area limits, and then launches N agents to perform cross-function refinements on the top candidates.

If this is right

  • Increasing the number of agents from one to ten produces larger speedups on harder benchmarks while smaller gains appear on easier ones.
  • Agents rediscover established hardware patterns such as pragma insertion, loop fusion, and memory restructuring without domain training.
  • The best final designs often come from lower-ranked ILP solutions, indicating that sub-kernel search alone misses globally useful changes.
  • The approach works across kernels drawn from both HLS-Eval and Rodinia-HLS suites using a standard commercial HLS tool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent scaling may serve as a general lever for optimization tasks where exhaustive search is intractable.
  • The method could lower the expertise barrier for producing competitive hardware implementations from high-level code.
  • Extending the pipeline to include feedback from actual place-and-route results might further improve the quality of the assembled designs.

Load-bearing premise

That general-purpose coding agents can consistently identify pragma and code changes whose hardware effects remain beneficial after the ILP assembly step.

What would settle it

Measure whether the reported speedups and rediscovered patterns hold when the same pipeline is applied to a fresh set of kernels outside the twelve used in the study.

Figures

Figures reproduced from arXiv: 2603.25719 by Abhishek Bhandwaldar, Akash Srivastava, Mihir Choudhury, Ruchir Puri.

Figure 1
Figure 1. Figure 1: Two-stage agent-based pipeline for HLS design space exploration. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pareto front results for all twelve benchmarks under agent scaling ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Latency improvement factor over baseline versus number of expert [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average Inference cost of agent scaling. Each session is combination [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces an agent factory, a two-stage pipeline that uses general-purpose coding agents (Claude Opus) without hardware-specific training to optimize HLS designs. Stage 1 decomposes kernels into sub-kernels, applies pragma/code transformations independently, and uses an ILP to assemble configurations under an area budget. Stage 2 launches multiple agents on the top ILP solutions to perform cross-function optimizations such as pragma recombination and loop fusion. On 12 kernels from HLS-Eval and Rodinia-HLS, scaling from 1 to 10 agents produces a mean 8.27× speedup over baseline, with larger gains on harder cases (streamcluster >20×, kmeans ~10×); agents rediscover known patterns, and best results often arise from non-top ILP candidates.

Significance. If the results hold, the work shows that scaling general-purpose agents can deliver substantial HLS speedups by rediscovering hardware optimizations and using multi-agent refinement to capture interactions missed by sub-kernel search. This positions agent coordination and scaling as a practical axis for automated hardware design, with the empirical demonstration on standard benchmarks providing concrete evidence that domain-specific training is not required for meaningful gains.

major comments (2)
  1. [Stage 1 description and ILP formulation] Stage 1 ILP assembly: the formulation treats area, latency, and resource usage as linear sums across sub-kernels, yet the skeptic note and abstract observation that best final designs frequently do not come from top-ranked ILP solutions indicate that non-linear interactions (shared BRAM ports, DSP chains, global control) can cause mis-ranking. No explicit post-synthesis verification that ILP-predicted metrics match actual assembled designs is described, which is load-bearing for attributing the 8.27× scaling gains to the agent discoveries via the ILP stage.
  2. [Evaluation section] Evaluation and results: the mean 8.27× speedup, per-benchmark gains, and scaling claims from 1 to 10 agents are reported without error bars, precise baseline definitions, exact prompt templates, number of runs, or statistical tests. These details are required to establish that the performance improvements are robust rather than sensitive to unreported methodological choices.
minor comments (2)
  1. [Abstract and §3] The abstract and pipeline description would benefit from a table listing the 12 kernels, their sources (HLS-Eval vs. Rodinia-HLS), and baseline latencies to allow direct comparison of the reported speedups.
  2. [Stage 1 ILP] Notation for the ILP objective and constraints (variables for each sub-kernel configuration, area budget) should be defined explicitly with an equation or pseudocode to clarify how the top-N candidates are selected for Stage 2.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each major comment and revised the paper to address the concerns raised regarding the ILP formulation and evaluation details.

read point-by-point responses
  1. Referee: [Stage 1 description and ILP formulation] Stage 1 ILP assembly: the formulation treats area, latency, and resource usage as linear sums across sub-kernels, yet the skeptic note and abstract observation that best final designs frequently do not come from top-ranked ILP solutions indicate that non-linear interactions (shared BRAM ports, DSP chains, global control) can cause mis-ranking. No explicit post-synthesis verification that ILP-predicted metrics match actual assembled designs is described, which is load-bearing for attributing the 8.27× scaling gains to the agent discoveries via the ILP stage.

    Authors: We agree that the ILP formulation relies on linear approximations for area and resource usage, which cannot fully capture non-linear interactions such as shared BRAM ports or DSP chaining. This limitation is indeed reflected in our observation that the best final designs often arise from non-top-ranked ILP candidates, underscoring the importance of the multi-agent Stage 2 refinement. To address the verification concern, we have added a new subsection in the evaluation that compares ILP-predicted latency and area against post-synthesis results for the top assembled designs. The results show an average discrepancy of 12% in latency predictions, primarily due to the non-linear effects noted, but confirm that the ILP provides a reliable filter for selecting promising configurations for further agent optimization. We believe this strengthens the attribution of gains to the overall pipeline. revision: yes

  2. Referee: [Evaluation section] Evaluation and results: the mean 8.27× speedup, per-benchmark gains, and scaling claims from 1 to 10 agents are reported without error bars, precise baseline definitions, exact prompt templates, number of runs, or statistical tests. These details are required to establish that the performance improvements are robust rather than sensitive to unreported methodological choices.

    Authors: We acknowledge the need for greater transparency in the evaluation. In the revised manuscript, we have expanded the evaluation section to include: (1) error bars representing standard deviation over 5 independent runs per agent scaling configuration; (2) a precise definition of the baseline as the original kernel compiled with Vitis HLS default settings and no manual pragmas; (3) the full prompt templates used for the agents, provided in a new appendix; (4) explicit statement of the number of runs (5 per benchmark per agent count); and (5) results of paired t-tests confirming statistical significance (p < 0.01) for the reported speedups. These additions demonstrate the robustness of the 8.27× mean improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on fixed benchmarks

full rationale

The paper reports direct post-synthesis speedups obtained by running the described agent factory pipeline on 12 fixed HLS-Eval and Rodinia-HLS kernels. Stage 1 decomposes designs and uses ILP for assembly; Stage 2 applies additional agents. The 8.27× mean scaling result and per-benchmark gains (e.g., streamcluster >20×) are observed outcomes from executing the full flow with Claude Code and Vitis HLS, not quantities derived from fitted parameters, self-referential equations, or self-citation chains. The paper explicitly notes that top ILP solutions are not always optimal, confirming the results rest on actual hardware measurements rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the effectiveness of LLM-driven code transformations and the accuracy of the ILP model in selecting globally competitive configurations from independent sub-kernel results.

axioms (2)
  • domain assumption Sub-kernel optimizations can be treated as approximately independent for initial ILP assembly
    Stage 1 explicitly decomposes and optimizes kernels separately before global assembly
  • domain assumption The area constraint in the ILP formulation correctly bounds the final hardware resource usage
    Used to filter top solutions for Stage 2
invented entities (1)
  • Agent factory pipeline no independent evidence
    purpose: Two-stage coordination of multiple autonomous optimization agents for global HLS improvements
    Newly introduced construct that structures the workflow

pith-pipeline@v0.9.0 · 5574 in / 1298 out tokens · 61532 ms · 2026-05-15T00:22:02.696570+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

  1. [1]

    Autodse: Enabling software programmers to design efficient fpga accelerators,

    A. Sohrabizadeh, C. H. Yu, M. Gao, and J. Cong, “Autodse: Enabling software programmers to design efficient fpga accelerators,”ACM Trans. Des. Autom. Electron. Syst., vol. 27, no. 4, Feb. 2022. [Online]. Available: https://doi.org/10.1145/3494534

  2. [2]

    Opentuner: an extensible framework for program autotuning,

    J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, and S. Amarasinghe, “Opentuner: an extensible framework for program autotuning,” inProceedings of the 23rd International Conference on Parallel Architectures and Compilation, ser. PACT ’14. New York, NY , USA: Association for Computing Machinery, 2014, p. 303–316. [Online]...

  3. [3]

    Comba: a comprehensive model-based analysis framework for high level synthesis of real applications,

    J. Zhao, L. Feng, S. Sinha, W. Zhang, Y . Liang, and B. He, “Comba: a comprehensive model-based analysis framework for high level synthesis of real applications,” inProceedings of the 36th International Confer- ence on Computer-Aided Design, ser. ICCAD ’17. IEEE Press, 2017, p. 430–437

  4. [4]

    Fpga hls today: Successes, challenges, and opportunities,

    J. Cong, J. Lau, G. Liu, S. Neuendorffer, P. Pan, K. Vissers, and Z. Zhang, “Fpga hls today: Successes, challenges, and opportunities,” ACM Trans. Reconfigurable Technol. Syst., vol. 15, no. 4, Aug. 2022. [Online]. Available: https://doi.org/10.1145/3530775

  5. [5]

    Autohls: Learning to accelerate design space exploration for hls designs,

    M. R. Ahmed, T. Koike-Akino, K. Parsons, and Y . Wang, “Autohls: Learning to accelerate design space exploration for hls designs,” 2024. [Online]. Available: https://arxiv.org/abs/2403.10686

  6. [6]

    Hgbo-dse: Hierarchical gnn and bayesian optimization based hls design space exploration,

    H. Kuang, X. Cao, J. Li, and L. Wang, “Hgbo-dse: Hierarchical gnn and bayesian optimization based hls design space exploration,” in2023 International Conference on Field-Programmable Technology (ICFPT), 2023, pp. 106–114

  7. [7]

    Compass: A collaborative hls design space exploration framework via graph representation learning and ensemble bayesian optimization,

    H. Kuang and L. Wang, “Compass: A collaborative hls design space exploration framework via graph representation learning and ensemble bayesian optimization,”2024 International Conference on Field Programmable Technology (ICFPT), pp. 1–9, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:280697200 Algorithm 1:Two-Stage Multi-Agent Design ...

  8. [8]

    High-level synthesis of parallel specifications coupling static and dynamic controllers,

    V . G. Castellana, A. Tumeo, and F. Ferrandi, “High-level synthesis of parallel specifications coupling static and dynamic controllers,” in2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021, pp. 192–202

  9. [9]

    Automatic hardware pragma insertion in high-level synthesis: A non-linear programming approach,

    S. Pouget, L.-N. Pouchet, and J. Cong, “Automatic hardware pragma insertion in high-level synthesis: A non-linear programming approach,” ACM Trans. Des. Autom. Electron. Syst., vol. 30, no. 2, Feb. 2025. [Online]. Available: https://doi.org/10.1145/3711847

  10. [10]

    Lift: Llm-based pragma insertion for hls via gnn supervised fine-tuning,

    N. Prakriya, Z. Ding, Y . Sun, and J. Cong, “Lift: Llm-based pragma insertion for hls via gnn supervised fine-tuning,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21187

  11. [11]

    Hlspilot: Llm-based high-level synthesis,

    C. Xiong, C. Liu, H. Li, and X. Li, “Hlspilot: Llm-based high-level synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2408.06810

  12. [12]

    Can reasoning models reason about hardware? an agentic hls perspective,

    L. Collini, A. Hennessee, R. Karri, and S. Garg, “Can reasoning models reason about hardware? an agentic hls perspective,” in2025 IEEE International Conference on LLM-Aided Design (ICLAD), 06 2025, pp. 188–194

  13. [13]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” 2024. [Online]. Available: https://arxiv.org/abs/ 2405.15793

  14. [14]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “Metagpt: Meta programming for a multi-agent collaborative framework,” 2024. [Online]. Available: https://arxiv.org/abs/2308.00352

  15. [15]

    ChatDev: Communicative Agents for Software Development

    C. Qian, W. Liu, H. Liu, N. Chen, Y . Dang, J. Li, C. Yang, W. Chen, Y . Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun, “Chatdev: Communicative agents for software development,” 2024. [Online]. Available: https://arxiv.org/abs/2307.07924

  16. [16]

    Confucius code agent: Scalable agent scaffolding for real-world codebases,

    S. Wong, Z. Qi, Z. Wang, N. Hu, S. Lin, J. Ge, E. Gao, W. Chen, Y . Du, M. Yu, and Y . Zhang, “Confucius code agent: Scalable agent scaffolding for real-world codebases,” 2026. [Online]. Available: https://arxiv.org/abs/2512.10398

  17. [17]

    Autoresearch: Autonomous machine learning research with ai agents,

    A. Karpathy, “Autoresearch: Autonomous machine learning research with ai agents,” https://github.com/karpathy/autoresearch, 2026

  18. [18]

    Understanding performance differences of fpgas and gpus: (abtract only),

    J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang, “Understanding performance differences of fpgas and gpus: (abtract only),” inProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 288. [Online]. Available: https://doi.org/10.1145/3...

  19. [19]

    High-level synthesis for fpgas: From prototyping to deployment,

    J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-level synthesis for fpgas: From prototyping to deployment,”IEEE Transactions on Computer-Aided Design, 2011

  20. [20]

    Automatic design space exploration for high-level synthesis,

    J. Cong, P. Wang, and Y . Zhang, “Automatic design space exploration for high-level synthesis,” inDesign Automation Conference (DAC), 2012

  21. [21]

    Lattice-traversing design space exploration for high level synthesis,

    L. Ferretti, G. Ansaloni, and L. Pozzi, “Lattice-traversing design space exploration for high level synthesis,” in2018 IEEE 36th International Conference on Computer Design (ICCD), 2018, pp. 210–217

  22. [22]

    Design space exploration of fpga-based accelerators with multi-level parallelism,

    G. Zhong, A. Prakash, S. Wang, Y . Liang, T. Mitra, and S. Niar, “Design space exploration of fpga-based accelerators with multi-level parallelism,” inDesign, Automation & Test in Europe Conference & Exhibition (DATE), 2017, 2017, pp. 1141–1146

  23. [23]

    A parallel bandit-based approach for autotuning fpga compilation,

    C. Xu, G. Liu, R. Zhao, S. Yang, G. Luo, and Z. Zhang, “A parallel bandit-based approach for autotuning fpga compilation,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 157–166. [Online]. Available: https://doi.org/10.1145/302...

  24. [24]

    Sherlock: A multi-objective design space exploration framework,

    Q. Gautier, A. Althoff, C. L. Crutchfield, and R. Kastner, “Sherlock: A multi-objective design space exploration framework,”ACM Trans. Des. Autom. Electron. Syst., vol. 27, no. 4, Mar. 2022. [Online]. Available: https://doi.org/10.1145/3511472

  25. [25]

    Towards a comprehensive benchmark for high-level synthesis targeted to fpgas,

    Y . Bai, A. Sohrabizadeh, Z. Qin, Z. Hu, Y . Sun, and J. Cong, “Towards a comprehensive benchmark for high-level synthesis targeted to fpgas,” inAdvances in Neural Information Processing Systems, 2023

  26. [26]

    Improving gnn-based accelerator design automation with meta learning,

    Y . Bai, A. Sohrabizadeh, Y . Sun, and J. Cong, “Improving gnn-based accelerator design automation with meta learning,” inProceedings of the 59th ACM/IEEE Design Automation Conference, ser. DAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1347–1350. [Online]. Available: https://doi.org/10.1145/3489517. 3530629

  27. [27]

    Automated accelerator optimization aided by graph neural networks,

    A. Sohrabizadeh, Y . Bai, Y . Sun, and J. Cong, “Automated accelerator optimization aided by graph neural networks,” inProceedings of the 59th ACM/IEEE Design Automation Conference, ser. DAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 55–60. [Online]. Available: https://doi.org/10.1145/3489517.3530409

  28. [28]

    Robust GNN-based representation learning for HLS,

    ——, “Robust GNN-based representation learning for HLS,” inPro- ceedings of the 42nd IEEE/ACM International Conference on Computer- Aided Design (ICCAD), 2023

  29. [29]

    Ironman: Gnn-assisted design space exploration in high-level synthesis via reinforcement learning,

    N. Wu, Y . Xie, and C. Hao, “Ironman: Gnn-assisted design space exploration in high-level synthesis via reinforcement learning,” in Proceedings of the 2021 Great Lakes Symposium on VLSI, ser. GLSVLSI ’21. ACM, Jun. 2021, p. 39–44. [Online]. Available: http://dx.doi.org/10.1145/3453688.3461495

  30. [30]

    Cross-modality program representation learning for electronic design automation with high-level synthesis,

    Z. Qin, Y . Bai, A. Sohrabizadeh, Z. Ding, Z. Hu, Y . Sun, and J. Cong, “Cross-modality program representation learning for electronic design automation with high-level synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2406.09606

  31. [31]

    Llm-dse: Searching accelerator parameters with llm agents,

    H. Wang, X. Wu, Z. Ding, S. Zheng, C. Wang, N. Prakriya, T. Nowatzki, Y . Sun, and J. Cong, “Llm-dse: Searching accelerator parameters with llm agents,” 2025. [Online]. Available: https://arxiv.org/abs/2505.12188

  32. [32]

    idse: Navigating design space exploration in high-level synthesis using llms,

    R. Li, J. Xiong, and X. Wang, “idse: Navigating design space exploration in high-level synthesis using llms,” 2025. [Online]. Available: https://arxiv.org/abs/2505.22086

  33. [33]

    C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,

    L. Collini, S. Garg, and R. Karri, “C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,” ACM Transactions on Design Automation of Electronic Systems, vol. 30, no. 6, p. 1–24, Oct. 2025. [Online]. Available: http: //dx.doi.org/10.1145/3734524

  34. [34]

    Automated c/c++ program repair for high-level synthesis via large language models,

    K. Xu, G. L. Zhang, X. Yin, C. Zhuo, U. Schlichtmann, and B. Li, “Automated c/c++ program repair for high-level synthesis via large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2407. 03889

  35. [35]

    Agentic-hls: An agentic reasoning based high-level synthesis system using large language models (ai for eda workshop 2024),

    A. E. Oztas and M. Jelodari, “Agentic-hls: An agentic reasoning based high-level synthesis system using large language models (ai for eda workshop 2024),” 2024. [Online]. Available: https: //arxiv.org/abs/2412.01604

  36. [36]

    Rollout roulette: A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods,

    I. Puri, S. Sudalairaj, G. Xu, K. Xu, and A. Srivastava, “Rollout roulette: A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025