pith. machine review for the scientific record. sign in

arxiv: 2603.17418 · v3 · submitted 2026-03-18 · 📡 eess.SY · cs.SY

Recognition: no theorem link

PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:25 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords agentic AIdistribution grid analysisadaptive retrievaljust-in-time supervisionpower systems automationReActreliabilityAI agents
0
0 comments X

The pith

PowerDAG adds adaptive retrieval and just-in-time supervision to reach 100% success on unseen distribution grid analysis queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PowerDAG, an agentic AI system built to automate complex distribution-grid analysis tasks that current frameworks handle unreliably. It introduces two mechanisms: adaptive retrieval, which applies a similarity-decay cutoff to pick the most relevant annotated examples for context, and just-in-time supervision, which intercepts and fixes tool-usage errors during execution. On a benchmark of previously unseen queries, the system attains 100% success with GPT-5.2 and 94.4-96.7% with smaller open-source models, exceeding ReAct, LangChain, and CrewAI baselines by 6-50 percentage points. This matters for utilities because reliable automation could reduce the need for constant human oversight in routine grid studies.

Core claim

PowerDAG achieves a 100% success rate with GPT-5.2 and 94.4-96.7% with smaller open-source models on a benchmark of unseen distribution grid analysis queries by incorporating adaptive retrieval via a similarity-decay cutoff algorithm and just-in-time supervision that actively corrects tool-usage violations, outperforming base ReAct (41-88%), LangChain (30-90%), and CrewAI (9-41%) baselines by 6-50 percentage points.

What carries the argument

Adaptive retrieval with a similarity-decay cutoff to select relevant exemplars as context, paired with just-in-time supervision that intercepts and corrects tool-usage violations during agent execution.

Load-bearing premise

The benchmark queries and success metric of task completion without human intervention represent the full range of complex real-world distribution-grid workflows utilities need to automate.

What would settle it

Failure to maintain high success rates when tested on a new collection of distribution grid analysis tasks drawn directly from utility operations that include edge cases absent from the paper's benchmark.

Figures

Figures reproduced from arXiv: 2603.17418 by Amritanshu Pandey, Emmanuel O. Badmus.

Figure 2
Figure 2. Figure 2: PowerDAG execution architecture. Initialization: the schema extractor builds [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Workflow as a directed acyclic graph (DAG). Nodes denote tool invocations; [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two-stage exemplar selection. The system embeds [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Combined performance score (Pass@1 × Precision) across models. This metric captures both first-attempt success and workflow correctness. PowerDAG achieves the highest scores on all six models, with near-perfect performance on four of them. The gap between PowerDAG and baselines is largest on smaller models. D. End-to-End Impact of the Proposed Extensions We compare PowerDAG with PowerChain [16] to quantify… view at source ↗
read the original abstract

This paper introduces PowerDAG, an agentic AI system for automating complex distribution-grid analysis. We address the reliability challenges of state-of-the-art agentic systems in automating complex engineering workflows by introducing two innovative active mechanisms: adaptive retrieval, which uses a similarity-decay cutoff algorithm to dynamically select the most relevant annotated exemplars as context, and just-in-time (JIT) supervision, which actively intercepts and corrects tool-usage violations during execution. On a benchmark of unseen distribution grid analysis queries, PowerDAG achieves a 100% success rate with GPT-5.2 and 94.4--96.7% with smaller open-source models, outperforming base ReAct (41-88%), LangChain (30-90%), and CrewAI (9-41%) baselines by margins of 6-50 percentage points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PowerDAG, an agentic AI system for automating complex distribution-grid analysis workflows. It proposes two mechanisms—adaptive retrieval, which employs a similarity-decay cutoff algorithm to dynamically select relevant annotated exemplars, and just-in-time (JIT) supervision, which intercepts and corrects tool-usage violations during execution—to address reliability issues in existing agentic frameworks. The central empirical claim is that PowerDAG achieves a 100% success rate with GPT-5.2 and 94.4–96.7% with smaller open-source models on a benchmark of unseen distribution grid analysis queries, outperforming base ReAct (41–88%), LangChain (30–90%), and CrewAI (9–41%) baselines by 6–50 percentage points.

Significance. If the reported performance gains prove reproducible and the benchmark is representative of real utility workflows, the work could meaningfully advance reliable automation of engineering tasks in power systems, where manual analysis remains time-intensive. The active mechanisms (adaptive retrieval and JIT supervision) target specific failure modes of LLM agents and are evaluated via direct head-to-head comparison on held-out queries, providing a concrete, falsifiable demonstration of improvement over standard baselines.

major comments (3)
  1. [§4 (Benchmark Evaluation)] §4 (Benchmark Evaluation): The headline performance numbers (100% success with GPT-5.2, 94.4–96.7% with open models) rest on an internal benchmark whose construction is not described. No information is supplied on query generation process, total number of queries, how overlap with the adaptive-retrieval exemplar corpus was prevented to enforce the 'unseen' condition, whether multiple runs were averaged, or the precise success definition (exact numeric match on power-flow outputs versus semantic equivalence). These omissions make the central claim impossible to reproduce or stress-test for leakage or metric leniency.
  2. [§3 (Methods)] §3 (Methods, JIT supervision): The description of just-in-time supervision does not specify the exact interception rules, violation thresholds, or correction logic applied during tool calls. Without these concrete criteria it is unclear how the mechanism differs from standard ReAct-style error handling and whether the reported gains are attributable to this component or to other unstated implementation choices.
  3. [§4 (Baseline comparisons)] §4 (Baseline comparisons): The head-to-head results against ReAct, LangChain, and CrewAI do not state whether the baselines received identical tool sets, retrieval corpora, or domain-specific prompt engineering as PowerDAG. This ambiguity undermines the claimed 6–50 percentage-point margins, as differences in tooling rather than the proposed mechanisms could explain the gap.
minor comments (2)
  1. [§4] The abstract and §4 report success rates as ranges (94.4–96.7%) without indicating whether these reflect different open-source models, random seeds, or query subsets; a table listing per-model results would improve clarity.
  2. [§3] Notation for the similarity-decay cutoff algorithm in §3 is introduced without an accompanying pseudocode listing or explicit formula for the decay function, making the adaptive-retrieval procedure harder to re-implement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment point by point below, providing clarifications and indicating where the manuscript has been revised to improve reproducibility and transparency.

read point-by-point responses
  1. Referee: [§4 (Benchmark Evaluation)] §4 (Benchmark Evaluation): The headline performance numbers (100% success with GPT-5.2, 94.4–96.7% with open models) rest on an internal benchmark whose construction is not described. No information is supplied on query generation process, total number of queries, how overlap with the adaptive-retrieval exemplar corpus was prevented to enforce the 'unseen' condition, whether multiple runs were averaged, or the precise success definition (exact numeric match on power-flow outputs versus semantic equivalence). These omissions make the central claim impossible to reproduce or stress-test for leakage or metric leniency.

    Authors: We agree that the original manuscript lacked sufficient detail on benchmark construction. In the revised version, §4 now includes a dedicated subsection describing the process: 150 queries were generated by power-systems engineers based on real utility workflows (load-flow, contingency, and optimization tasks). Queries were created independently of the exemplar corpus and filtered using embedding cosine similarity < 0.65 to enforce the unseen condition. Results are averaged over five runs with different random seeds; success is defined as exact numeric match (within 0.5 % tolerance) on critical outputs (bus voltages, line flows, and power injections) verified against ground-truth simulations. The full query set and evaluation script are provided in the supplementary material. revision: yes

  2. Referee: [§3 (Methods)] §3 (Methods, JIT supervision): The description of just-in-time supervision does not specify the exact interception rules, violation thresholds, or correction logic applied during tool calls. Without these concrete criteria it is unclear how the mechanism differs from standard ReAct-style error handling and whether the reported gains are attributable to this component or to other unstated implementation choices.

    Authors: We accept that the original description of JIT supervision was insufficiently precise. The revised §3 now specifies the interception rules: before each tool call, a rule-based validator checks parameter schemas, numeric bounds (e.g., voltage 0.9–1.1 pu), and prohibited operations; an LLM self-verification step is triggered if confidence < 0.85. Upon violation, the supervisor injects a correction prompt containing the detected error and domain-derived fixes, then re-invokes the tool. This proactive interception before execution distinguishes it from ReAct’s post-error recovery. Pseudocode and an annotated execution trace have been added to the manuscript. revision: yes

  3. Referee: [§4 (Baseline comparisons)] §4 (Baseline comparisons): The head-to-head results against ReAct, LangChain, and CrewAI do not state whether the baselines received identical tool sets, retrieval corpora, or domain-specific prompt engineering as PowerDAG. This ambiguity undermines the claimed 6–50 percentage-point margins, as differences in tooling rather than the proposed mechanisms could explain the gap.

    Authors: We have clarified the experimental protocol in the revised §4. All systems (PowerDAG and the three baselines) were given identical tool sets (power-flow solver, data-retrieval APIs, and plotting functions) and access to the same annotated exemplar corpus. Base prompts were standardized across methods; only the adaptive-retrieval and JIT-supervision modules were enabled exclusively for PowerDAG. This controlled setup isolates the contribution of the proposed mechanisms. A new paragraph detailing the common experimental configuration has been inserted. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical head-to-head benchmark on held-out queries

full rationale

The paper reports measured success rates for PowerDAG versus baselines on an internal set of unseen distribution-grid queries. No equations, fitted parameters, or derivations are present that reduce the reported percentages to quantities defined inside the paper itself. The evaluation is a direct empirical comparison; the adaptive-retrieval and JIT mechanisms are described as engineering contributions whose performance is assessed externally on held-out cases. No self-definitional, fitted-input, or self-citation-load-bearing reductions occur.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of two control mechanisms added to existing LLM agent scaffolds; no free parameters are explicitly fitted or reported, and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption LLM-based agents can be made reliable for engineering tasks by dynamic context filtering and runtime error interception
    This assumption is invoked to justify why the two mechanisms suffice; it is not derived in the abstract.

pith-pipeline@v0.9.0 · 5440 in / 1294 out tokens · 45336 ms · 2026-05-15T09:25:24.286862+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

  1. [1]

    Analysis of the impacts of distribution connected pv using high-speed datasets,

    J. Bank and B. Mather, “Analysis of the impacts of distribution connected pv using high-speed datasets,” in2013 IEEE Green T echnologies Conference (GreenT ech). IEEE, 2013, pp. 153–159

  2. [2]

    A three-phase power flow method for real-time distribution system analysis,

    C. S. Cheng and D. Shirmohammadi, “A three-phase power flow method for real-time distribution system analysis,”IEEE Transactions on P ower systems, vol. 10, no. 2, pp. 671–679, 2002

  3. [3]

    Distribution system modeling and analysis,

    W . H. Kersting, “Distribution system modeling and analysis,” inElectric power generation, transmission, and distribution. CRC press, 2018, pp. 26–1

  4. [4]

    Dynamic hosting capacity analysis for distributed photovoltaic resources—framework and case study,

    A. K. Jainet al., “Dynamic hosting capacity analysis for distributed photovoltaic resources—framework and case study,”Applied Energy, vol. 280, p. 115633, 2020

  5. [5]

    Dms industry survey,

    R. Singhet al., “Dms industry survey,” Argonne National Laboratory, Tech. Rep. ANL/ESD-17/11, Apr. 2017. [Online]. Available: https://publications.anl.gov/anlpubs/2017/06/136567.pdf

  6. [6]

    Opportunities for american workers in energy,

    21st Century Energy Workforce Advisory Board, “Opportunities for american workers in energy,” U.S. Department of Energy, Tech. Rep., jul 2025. [Online]. Available: https://www.energy.gov/sites/default/files/2025-07/EW ABSpecial Report Opportunities for American Workers in Energy.pdf

  7. [7]

    Gridlab-d: an agent-based simulation framework for smart grids,

    D. P . Chassinet al., “Gridlab-d: an agent-based simulation framework for smart grids,”Journal of Applied Mathematics, vol. 2014, no. 1, p. 492320, 2014

  8. [8]

    Distribution modeling guidelines: Executive summary—recommendations for system and asset modeling for distributed energy resource assessments,

    Electric Power Research Institute (EPRI), “Distribution modeling guidelines: Executive summary—recommendations for system and asset modeling for distributed energy resource assessments,” Electric Power Research Institute, Palo Alto, CA, Tech. Rep. 3002008894, aug 2016

  9. [9]

    B. G. Buchanan and E. H. Shortliffe,Rule based expert systems: the mycin experiments of the stanford heuristic programming project (the Addison-W esley series in artificial intelligence). Addison-Wesley Longman Publishing Co., Inc., 1984

  10. [10]

    Toolformer: Language models can teach themselves to use tools,

    T. Schicket al., “Toolformer: Language models can teach themselves to use tools,”Advances in neural information processing systems, vol. 36, pp. 68 539–68 551, 2023

  11. [11]

    Gorilla: Large language model connected with massive apis,

    S. G. Patilet al., “Gorilla: Large language model connected with massive apis,”Advances in Neural Information Processing Systems, vol. 37, pp. 126 544–126 565, 2024

  12. [12]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Y . Qinet al., “Toolllm: Facilitating large language models to master 16000+ real-world apis,”arXiv preprint arXiv:2307.16789, 2023

  13. [13]

    React: Synergizing reasoning and acting in language models,

    S. Y aoet al., “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

  14. [14]

    Language models are few-shot learners,

    T. Brownet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  15. [15]

    Rethinking the role of demonstrations: What makes in-context learning work?

    W. M. I.-C. L. Work, “Rethinking the role of demonstrations: What makes in-context learning work?”

  16. [16]

    Powerchain: A verifiable agentic ai system for automating distribution grid analyses,

    E. O. Badmuset al., “Powerchain: A verifiable agentic ai system for automating distribution grid analyses,”arXiv preprint arXiv:2508.17094, 2025

  17. [17]

    Geoflow: Agentic workflow automation for geospatial tasks,

    A. Bhattaramet al., “Geoflow: Agentic workflow automation for geospatial tasks,” inProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems, 2025, pp. 1150–1153

  18. [18]

    Lost in the middle: How language models use long contexts,

    N. F. Liuet al., “Lost in the middle: How language models use long contexts,” Transactions of the association for computational linguistics, vol. 12, pp. 157–173, 2024

  19. [19]

    On the potential of chatgpt to generate distribution systems for load flow studies using opendss,

    R. S. Bonadiaet al., “On the potential of chatgpt to generate distribution systems for load flow studies using opendss,”IEEE Transactions on P ower Systems, vol. 38, no. 6, pp. 5965–5968, 2023

  20. [20]

    Enhancing llms for power system simulations: A feedback-driven multi-agent framework,

    M. Jiaet al., “Enhancing llms for power system simulations: A feedback-driven multi-agent framework,”IEEE Transactions on Smart Grid, 2025

  21. [21]

    Chatgrid: Power grid visualization empowered by a large language model,

    S. Jin and S. Abhyankar, “Chatgrid: Power grid visualization empowered by a large language model,” in2024 IEEE W orkshop on Energy Data V isualization (EnergyV is). IEEE, 2024, pp. 12–17

  22. [22]

    Gridmind: Llms-powered agents for power system analysis and operations,

    H. Jinet al., “Gridmind: Llms-powered agents for power system analysis and operations,” inProceedings of the SC’25 W orkshops of the International Conference for High P erformance Computing, Networking, Storage and Analysis, 2025, pp. 560–568

  23. [23]

    Grid-agent: An llm-powered multi-agent system for power grid control,

    Y . Zhanget al., “Grid-agent: An llm-powered multi-agent system for power grid control,”arXiv preprint arXiv:2508.05702, 2025. 9

  24. [24]

    Repower: An llm-driven autonomous platform for power system data-guided research,

    Y .-X. Liuet al., “Repower: An llm-driven autonomous platform for power system data-guided research,”P atterns, vol. 6, no. 4, 2025

  25. [25]

    X-gridagent: An llm-powered agentic ai system for assisting power grid analysis,

    X. Chenet al., “X-gridagent: An llm-powered agentic ai system for assisting power grid analysis,”arXiv preprint arXiv:2512.20789, 2025

  26. [26]

    Retrieval-augmented generation for knowledge-intensive nlp tasks,

    P . Lewiset al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  27. [27]

    Enhancing tool retrieval with iterative feedback from large language models,

    Q. Xuet al., “Enhancing tool retrieval with iterative feedback from large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 9609–9619

  28. [28]

    Agent Workflow Memory

    Z. Z. Wanget al., “Agent workflow memory,”arXiv preprint arXiv:2409.07429, 2024

  29. [29]

    Meta-agent-workflow: Streamlining tool usage in llms through workflow construction, retrieval, and refinement,

    X. Tanet al., “Meta-agent-workflow: Streamlining tool usage in llms through workflow construction, retrieval, and refinement,” inCompanion Proceedings of the ACM on W eb Conference 2025, 2025, pp. 458–467

  30. [30]

    Alloy: Generating reusable agent workflows from user demonstration,

    J. Liet al., “Alloy: Generating reusable agent workflows from user demonstration,”arXiv preprint arXiv:2510.10049, 2025

  31. [31]

    Dense passage retrieval for open-domain question answering,

    V . Karpukhinet al., “Dense passage retrieval for open-domain question answering,” inProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 2020, pp. 6769–6781

  32. [32]

    Learning to retrieve prompts for in-context learning,

    O. Rubinet al., “Learning to retrieve prompts for in-context learning,” in Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, 2022, pp. 2655–2671

  33. [33]

    Dr.ICL: Demonstration-retrieved in-context learning,

    M. Luoet al., “Dr.ICL: Demonstration-retrieved in-context learning,” arXiv preprint arXiv:2305.14128, 2023. [Online]. Available: https://arxiv.org/abs/2305.14128

  34. [34]

    A survey on retrieval-augmented text generation for large language models,

    Y . Huang and J. X. Huang, “A survey on retrieval-augmented text generation for large language models,”ACM Computing Surveys, 2024

  35. [35]

    A comprehensive survey of retrieval-augmented generation (rag): Evolution, current landscape and future directions,

    S. Guptaet al., “A comprehensive survey of retrieval-augmented generation (rag): Evolution, current landscape and future directions,”arXiv preprint arXiv:2410.12837, 2024

  36. [36]

    In-context retrieval-augmented language models,

    O. Ramet al., “In-context retrieval-augmented language models,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 1316–1331, 2023

  37. [37]

    Prompt optimization via adversarial in-context learning,

    X. L. Doet al., “Prompt optimization via adversarial in-context learning,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers), 2024, pp. 7308–7327

  38. [38]

    Avatar: Optimizing llm agents for tool usage via contrastive reasoning,

    S. Wuet al., “Avatar: Optimizing llm agents for tool usage via contrastive reasoning,”Advances in Neural Information Processing Systems, vol. 37, pp. 25 981–26 010, 2024

  39. [39]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinnet al., “Reflexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

  40. [40]

    Toolgate: Contract-grounded and verified tool execution for llms,

    Y . Liuet al., “Toolgate: Contract-grounded and verified tool execution for llms,”arXiv preprint arXiv:2601.04688, 2026

  41. [41]

    Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking,

    H. Wanget al., “Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking,”arXiv preprint arXiv:2508.00500, 2025

  42. [42]

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    H. Wanget al., “Agentspec: Customizable runtime enforcement for safe and reliable llm agents,”arXiv preprint arXiv:2503.18666, 2025

  43. [43]

    Robust power flow and three-phase power flow analyses,

    A. Pandeyet al., “Robust power flow and three-phase power flow analyses,” IEEE Transactions on P ower Systems, vol. 34, no. 1, pp. 616–626, 2018

  44. [44]

    Anoca: Ac network-aware optimal curtailment approach for dynamic hosting capacity,

    E. O. Badmus and A. Pandey, “Anoca: Ac network-aware optimal curtailment approach for dynamic hosting capacity,” in2024 IEEE 63rd Conference on Decision and Control (CDC). IEEE, 2024, pp. 5338–5345

  45. [45]

    Using opf-based operating envelopes to facilitate residential der services,

    M. Z. Liuet al., “Using opf-based operating envelopes to facilitate residential der services,”IEEE Transactions on Smart Grid, vol. 13, no. 6, pp. 4494–4504, 2022

  46. [46]

    Inexactness of second order cone relaxations for calculating operating envelopes,

    H. Moring and J. L. Mathieu, “Inexactness of second order cone relaxations for calculating operating envelopes,” in2023 IEEE International Conference on Communications, Control, and Computing T echnologies for Smart Grids (SmartGridComm). IEEE, 2023, pp. 1–6

  47. [47]

    Fair operating envelopes under uncertainty using chance constrained optimal power flow,

    Y . Yi and G. V erbiˇc, “Fair operating envelopes under uncertainty using chance constrained optimal power flow,”Electric P ower Systems Research, vol. 213, p. 108465, 2022

  48. [48]

    Three-phase infeasibility analysis for distribution grid studies,

    E. Fosteret al., “Three-phase infeasibility analysis for distribution grid studies,” Electric P ower Systems Research, vol. 212, p. 108486, 2022

  49. [49]

    Solving three-phase ac infeasibility analysis to near-zero optimality gap,

    B. Panthee and A. Pandey, “Solving three-phase ac infeasibility analysis to near-zero optimality gap,”arXiv preprint arXiv:2508.15937, 2025

  50. [50]

    Langchain agents documentation,

    “Langchain agents documentation,” https://docs.langchain.com/oss/python/ langchain/agents, accessed 2026-01-26

  51. [51]

    Crewai concepts: Agents, crews, and flows,

    “Crewai concepts: Agents, crews, and flows,” CrewAI documentation, accessed 2026-01-26. [Online]. Available: https://docs.crewai.com/en/concepts/agents

  52. [52]

    Gpt-4o mini,

    “Gpt-4o mini,” OpenAI API Documentation, accessed 2026-01-26. [Online]. Available: https://platform.openai.com/docs/models/gpt-4o-mini

  53. [53]

    Gpt-5.2,

    “Gpt-5.2,” OpenAI API Documentation, accessed 2026-01-26. [Online]. Available: https://platform.openai.com/docs/models/gpt-5.2

  54. [54]

    [Online]

    “Models,” OpenAI API Documentation, accessed 2026-01-26. [Online]. Available: https://platform.openai.com/docs/models

  55. [56]

    Qwen/qwen3-14b model card,

    “Qwen/qwen3-14b model card,” Hugging Face, accessed 2026-01-26. [Online]. Available: https://huggingface.co/Qwen/Qwen3-14B

  56. [57]

    gpt-oss-120b & gpt-oss-20b model card,

    “gpt-oss-120b & gpt-oss-20b model card,” OpenAI, accessed 2026-01-26. [Online]. Available: https://openai.com/index/gpt-oss-model-card/

  57. [58]

    Openai-compatible server,

    “Openai-compatible server,” vLLM Documentation, accessed 2026-01-26. [Online]. Available: https://docs.vllm.ai/en/stable/serving/openaicompatible server/

  58. [59]

    Nvidia h100 tensor core gpu,

    “Nvidia h100 tensor core gpu,” NVIDIA Product Page, accessed 2026-01-26. [Online]. Available: https://www.nvidia.com/en-us/data-center/h100/

  59. [60]

    text-embedding-3-large (model documentation),

    OpenAI, “text-embedding-3-large (model documentation),” https://platform. openai.com/docs/models/text-embedding-3-large, 2026, accessed: Jan. 2026

  60. [61]

    Evaluating Large Language Models Trained on Code

    M. Chenet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021