Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation

Hui Wu; Xiaoyang Wang; Zhong Fan

arxiv: 2605.31478 · v1 · pith:YI6C52JSnew · submitted 2026-05-29 · 💻 cs.SE · cs.CL· cs.SY· eess.SY

Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation

Hui Wu , Xiaoyang Wang , Zhong Fan This is my paper

Pith reviewed 2026-06-28 21:23 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.SYeess.SY

keywords large language modelscode generationpower systemsAPI knowledge boundariesbenchmarkprompt interventionon-premise deploymentsimulation libraries

0 comments

The pith

A boundary-aware intervention that corrects API knowledge gaps raises LLM accuracy on power-system code generation by 32 to 56 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that errors when LLMs generate code for power system analysis arise mainly from gaps in knowing exact library functions and parameters rather than from shortfalls in logical reasoning. It develops a probing method to map each model's API knowledge profile across documentation levels and pairs it with an intervention that estimates documentation needs for a query then supplies the information either before generation or through later correction. This produces large accuracy gains on a benchmark of thousands of natural-language to code tasks that include numerical verification. A sympathetic reader would conclude that targeted knowledge injection at inference time can make open models suitable for sensitive on-premise use in regulated settings without retraining.

Core claim

The authors claim that first-pass failures in power-system code generation are dominated by structured API-knowledge boundary errors such as hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. They introduce a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction, which improves accuracy by 32 to 56 points across every evaluated open-weight model of at least 7B parameters and every commercial API on a 2,000-task frozen benchmark.

What carries the argument

The boundary-aware intervention, which estimates the API documentation demands from the input query and supplies relevant information proactively or corrects outputs reactively.

If this is right

The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points.
Open-weight models in the 70B-120B range reach the commercial mid-tier accuracy range.
The targeted prompts preserve the full-context accuracy ceiling while using 41 percent of the prompt-token cost.
The largest evaluated open-weight models lead the panel on the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same probing and injection pattern could be tested on code generation tasks that rely on other versioned domain libraries.
The results suggest that maintaining accurate documentation access may be more efficient than repeated fine-tuning for keeping models current.
Utilities could apply the method to achieve reliable local assistance without sending data to external services.

Load-bearing premise

That the dominant source of first-pass errors in this domain is incomplete knowledge of library APIs rather than deeper limitations in reasoning about the underlying physical or mathematical problems.

What would settle it

A collection of tasks where models still produce incorrect code even after correct API documentation is supplied, because they misunderstand the underlying calculations or optimization logic, would show that the intervention does not address the main failure mode.

Figures

Figures reproduced from arXiv: 2605.31478 by Hui Wu, Xiaoyang Wang, Zhong Fan.

**Figure 1.** Figure 1: Reliability workflow for LLM-based power-system code generation. Data flow per item: natural-language grid-analysis request → LLM-generated program → execution on the target grid case → scalar engineering output. PowerCodeBench provides execution-validated tasks across power-system analyses (contingency, OPF, short-circuit, time-series). L0–L3 probing converts structured API documentation into per-model kn… view at source ↗

**Figure 2.** Figure 2: Frozen-release composition statistics. (a) Item count per difficulty level (D1–D4). (b) Task-family frequency distribution across the 15 families sampled in this release. (c) Grid network size distribution across the 39 test cases used (bus count). 6 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The L0–L3 probe generator and profile output. Structured API documentation is transformed into recognition, recall, comprehension, and application probes, then aggregated into a model-specific knowledge-risk profile ρM(f) over the API corpus. instantiated from the corresponding documentation fields when available. This tests whether the model can correctly reason about API semantics when the function name … view at source ↗

**Figure 4.** Figure 4: Category-averaged knowledge-risk profile derived from ρM(f)= 1 − L0..L3 across the open-weight panel (10 models) and 4 closed-source APIs; warmer colours indicate higher risk. Vertical separators delimit the two groups; the right-most column is the all-model mean. must-inject floor, admitted only when its per-layer injection score Iℓ ranks above competing candidates in the budget. L3 also scales non-monoto… view at source ↗

**Figure 5.** Figure 5: Boundary-aware intervention: two-phase pipeline. Phase 1 (Proactive) injects layer-specific API documentation before generation, guided by task demand dˆ(f, q) and model-specific knowledge deficits over L0–L3. Phase 2 (Reactive) maps execution feedback to either runtime/API repair routes or value-level semantic diagnosis before re-execution. Three properties distinguish boundary-aware intervention from gen… view at source ↗

**Figure 6.** Figure 6: Cross-vendor results for the 9-model panel of [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Per-task accuracy heatmap with identical row and column ordering in both panels for direct side-by-side comparison. (a) Baseline A (no proactive injection, R0). (b) Full method C+FDRS after 3 fix rounds. Rows are pandapower-derived task families; columns split into open-weight models (left) and closed-source APIs (right) [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Token economy: Pareto trade-offs (a, b) and per-model saving breakdowns (c.1, c.2). (a) Proactive R0: prompt tokens vs. matched items (out of 2,000) for conditions A/B/C/X/R. (b) Reactive economy under base C: prompt tokens vs. successful fix attempts for FX/FD/FDR/FDRS. (c.1) Per-model R0 prompt decomposition under C—base prompt, documentation block (L0–L3 snippets plus DataFrame boundary cards), and temp… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used to automate power-system analysis, but many utilities and energy-research labs require on-premise serving for confidentiality, regulatory, reproducibility, and cost reasons. This makes the reliability of open-weight models a deployment issue. We show that first-pass failures in power-system code generation are dominated not by reasoning alone, but by structured API-knowledge boundary errors: hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. We introduce PowerCodeBench, an execution-validated benchmark generator that pairs natural-language operator queries with pandapower code and numerical ground truth; an L0-L3 documentation-driven probing procedure that measures per-model API knowledge profiles; and a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction. On a 2,000-task frozen release, we evaluate ten open-weight LLMs (1.5B-480B parameters) and four commercial mid-tier APIs. The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points. Open-weight models in the 70B-120B range match the commercial mid-tier accuracy range, while Llama-3.1-405B and Qwen3-Coder-480B lead the panel. The targeted prompts preserve the full-context accuracy ceiling while using 41% of the prompt-token cost. The result is an accuracy-side, deployment-time path toward reliable on-premise LLM assistance for grid-analysis workflows without fine-tuning or cloud inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete, deployment-focused method to cut API-knowledge errors in power-system code gen and reports large uniform gains on their new benchmark.

read the letter

The main takeaway is that first-pass LLM failures on pandapower tasks are mostly boundary errors rather than reasoning shortfalls, and a lightweight query-side demand estimate plus targeted doc injection fixes most of them. They built PowerCodeBench as an execution-validated generator, added an L0-L3 probing scheme to map per-model API coverage, and tested a boundary-aware intervention on ten open-weight models plus four commercial APIs.

What stands out is the consistency: every model 7B and up, plus the commercial endpoints, gained 32-56 accuracy points on the 2,000-task frozen set, with 70-120B open models reaching commercial mid-tier levels and the largest models leading. The intervention keeps the full-context ceiling while cutting token use to 41 percent. That directly addresses the on-premise constraint utilities face.

The soft spots are the usual ones for an empirical systems paper. The abstract states that API-boundary errors dominate, but without the full error breakdown or ablation on how tasks were sampled it is hard to know how much the gains depend on the benchmark construction. The L0-L3 procedure sounds useful for profiling, yet the paper will need to show it is reproducible and not just tuned to pandapower. No statistical tests or variance numbers appear in the summary, so the 32-56 point range needs the raw per-model tables and controls to hold up.

This is for people working on domain-specific code generation or on-prem LLM deployment in engineering fields. It is worth sending to review because the mechanism is explicit, the effect size is large across many models, and the practical constraint it targets is real. The central empirical claim looks falsifiable once the benchmark and intervention details are checked.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that first-pass failures in power-system code generation are dominated by structured API-knowledge boundary errors rather than reasoning shortfalls. It introduces PowerCodeBench (an execution-validated benchmark pairing natural-language queries with pandapower code and numerical ground truth), an L0-L3 documentation-driven probing procedure to measure per-model API knowledge profiles, and a boundary-aware intervention (query-side demand estimation + targeted documentation injection + routed correction). On a frozen 2,000-task release, the intervention is reported to raise accuracy by 32-56 points for every evaluated open-weight model ≥7B and every commercial API tested, while preserving the full-context accuracy ceiling at 41% of the prompt-token cost.

Significance. If the empirical results hold, the work supplies a practical, deployment-time method to improve reliability of open-weight LLMs for grid-analysis workflows without fine-tuning or cloud inference. The uniform gains across model scales (including 70B-120B models matching commercial mid-tier performance) and the explicit targeting of hallucinated function names, misused parameters, and mishandled result tables constitute a concrete contribution to on-premise LLM deployment in regulated domains.

major comments (1)

[Abstract] The abstract asserts large accuracy gains and benchmark validation but supplies no experimental details, error bars, dataset construction steps, or statistical tests; only the abstract is available, so the support for the central claim cannot be verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The sole major comment concerns the abstract's lack of experimental details and the fact that only the abstract appears to have been available. We address this point below; the full manuscript supplies all requested information in the body and appendices.

read point-by-point responses

Referee: [Abstract] The abstract asserts large accuracy gains and benchmark validation but supplies no experimental details, error bars, dataset construction steps, or statistical tests; only the abstract is available, so the support for the central claim cannot be verified.

Authors: We agree the abstract is concise by design and omits granular experimental details, which is standard to keep abstracts high-level. The full manuscript details dataset construction (Section 3: execution-validated pairing of queries with pandapower code and numerical ground truth), error bars and statistical tests (Section 4 and Appendix B: mean accuracy with standard deviations over runs plus significance testing), and the full evaluation protocol. The observation that 'only the abstract is available' indicates the complete manuscript was not supplied in the review package; we ask the editor to provide the full text. No revision to the abstract is planned, as expansion would violate length norms without improving clarity. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports an empirical study: it constructs PowerCodeBench as an execution-validated benchmark, runs an L0-L3 probing procedure to measure API knowledge profiles, applies a boundary-aware intervention (demand estimation + documentation injection + routed correction), and measures accuracy gains on a frozen 2,000-task release. No equations, fitted parameters, or derivations are present; the accuracy improvements are direct experimental measurements on held-out tasks. No self-citations appear in the provided text, and the central premise (that first-pass errors are dominated by structured API-knowledge failures) is tested by the intervention mechanism itself rather than assumed via prior results. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based solely on the abstract; full paper may contain additional parameters or assumptions not visible here.

axioms (1)

domain assumption First-pass failures in power-system code generation are dominated by structured API-knowledge boundary errors rather than reasoning limitations
Explicitly stated as the motivating observation in the abstract.

invented entities (2)

PowerCodeBench no independent evidence
purpose: Execution-validated benchmark generator pairing natural-language operator queries with pandapower code and numerical ground truth
Newly introduced benchmark described in the abstract.
L0-L3 documentation-driven probing procedure no independent evidence
purpose: Measures per-model API knowledge profiles
New procedure introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5832 in / 1353 out tokens · 35740 ms · 2026-06-28T21:23:30.110353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 34 canonical work pages · 7 internal anchors

[1]

Donti and J

Priya L. Donti and J. Zico Kolter. Machine learning for sustainable energy systems.Annual Review of Environment and Resources, 46:719–747, 2021. doi: 10.1146/annurev-environ-020220-061831

work page doi:10.1146/annurev-environ-020220-061831 2021
[2]

Lopez Garcia, Christophe Ballif, and Matthias Galus

Fabian Heymann, Hugo Quest, Tania B. Lopez Garcia, Christophe Ballif, and Matthias Galus. Reviewing 40 years of artificial intelligence applied to power systems—a taxonomic perspective.Energy and AI, 15:100322,
[3]

doi: 10.1016/j.egyai.2023.100322

work page doi:10.1016/j.egyai.2023.100322 2023
[4]

Large foundation models for power systems

Chenghao Huang, Siyang Li, Ruohong Liu, Hao Wang, and Yize Chen. Large foundation models for power systems. In2024 IEEE Power & Energy Society General Meeting (PESGM), pages 1–5, 2024. doi: 10.1109/ PESGM51994.2024.10688670

work page arXiv 2024
[5]

Thatte, Na Li, and Le Xie

Subir Majumder, Lin Dong, Fatemeh Doudi, Yuting Cai, Chao Tian, Dileep Kalathil, Kevin Ding, Anupam A. Thatte, Na Li, and Le Xie. Exploring the capabilities and limitations of large language models in the electric energy sector.Joule, 8(6):1544–1549, 2024. doi: 10.1016/j.joule.2024.05.009

work page doi:10.1016/j.joule.2024.05.009 2024
[6]

Thurner, A

Leon Thurner, Alexander Scheidler, Florian Schäfer, Jan-Hendrik Menke, Julian Dollichon, Friederike Meier, Steffen Meinecke, and Martin Braun. pandapower—an open-source python tool for convenient modeling, analysis, and optimization of electric power systems.IEEE Transactions on Power Systems, 33(6):6510–6521, 2018. doi: 10.1109/TPWRS.2018.2829021. 40

work page doi:10.1109/tpwrs.2018.2829021 2018
[7]

AI in power systems: A systematic review of key matters of concern.Energy Informatics, 8:76, 2025

Felipe Henao, Robert Edgell, Ambar Sharma, and Jeffrey Olney. AI in power systems: A systematic review of key matters of concern.Energy Informatics, 8:76, 2025. doi: 10.1186/s42162-025-00529-1

work page doi:10.1186/s42162-025-00529-1 2025
[8]

Secure and trustworthy energy systems: A four-layer threat model and defense-in-depth framework.Energy, 344:140027, 2026

Yuheng Cheng, Xiyuan Zhou, Huan Zhao, Gaoqi Liang, Fushuan Wen, and Junhua Zhao. Secure and trustworthy energy systems: A four-layer threat model and defense-in-depth framework.Energy, 344:140027, 2026. doi: 10.1016/j.energy.2026.140027

work page doi:10.1016/j.energy.2026.140027 2026
[9]

CodeUpdateArena: Bench- marking knowledge editing on API updates.ArXiv, abs/2407.06249, 2024

Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. CodeUpdateArena: Benchmarking knowledge editing on API updates.arXiv preprint arXiv:2407.06249, 2024

work page arXiv 2024
[10]

LibEvolutionEval: A benchmark and study for version-specific code generation

Sachit Kuhar, Wasi Uddin Ahmad, Zijian Wang, Nihal Jain, Haifeng Qian, Baishakhi Ray, Murali Krishna Ramanathan, Xiaofei Ma, and Anoop Deoras. LibEvolutionEval: A benchmark and study for version-specific code generation. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

work page arXiv 2025
[11]

Identifying and mitigating API misuse in large language models.arXiv preprint arXiv:2503.22821, 2025

Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. Identifying and mitigating API misuse in large language models.arXiv preprint arXiv:2503.22821, 2025

work page arXiv 2025
[12]

Towards mitigating API hallucination in code generated by LLMs with hierarchical dependency aware.arXiv preprint arXiv:2505.05057, 2025

Yujia Chen, Mingyu Chen, Cuiyun Gao, Zhihan Jiang, Zhongqi Li, and Yuchi Ma. Towards mitigating API hallucination in code generated by LLMs with hierarchical dependency aware.arXiv preprint arXiv:2505.05057, 2025

work page arXiv 2025
[13]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Gridmind: Llms-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

Hongwei Jin, Kibaek Kim, and Jonghwan Kwon. Gridmind: Llms-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

work page arXiv 2025
[15]

Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025

Qian Zhang and Le Xie. Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025. doi: 10.1109/MPE.2025.3579718

work page doi:10.1109/mpe.2025.3579718 2025
[16]

A large language model for advanced power dispatch.Scientific Reports, 15:8925, 2025

Yuheng Cheng, Huan Zhao, Xiyuan Zhou, Junhua Zhao, Yuji Cao, Chao Yang, and Xinlei Cai. A large language model for advanced power dispatch.Scientific Reports, 15:8925, 2025. doi: 10.1038/s41598-025-91940-x

work page doi:10.1038/s41598-025-91940-x 2025
[17]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In Advances in Neural Information Processing Systems, volume 34, pages 20389–20403, 2021

2021
[20]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

2024
[21]

DS-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. DS-1000: A natural and reliable benchmark for data science code generation. In Proceedings of the 40th International Conference on Machine Learning, pages 18319–18345, 2023. 41

2023
[22]

Scicode: A research coding benchmark curated by scientists

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, et al. Scicode: A research coding benchmark curated by scientists. InAdvances in Neural Information Processing Systems, 2024

2024
[23]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. InInternational Conference on Learning Representations, 2025

2025
[24]

Elecbench: A power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365, 2024

Xiyuan Zhou, Huan Zhao, Yuheng Cheng, Yuji Cao, Gaoqi Liang, Guolong Liu, Wenxuan Liu, Yan Xu, and Junhua Zhao. Elecbench: A power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365, 2024

work page arXiv 2024
[25]

Abhyankar

Shuangshuang Jin and Shrirang G. Abhyankar. Chatgrid: Power grid visualization empowered by large language model. In2024 IEEE Workshop on Energy Data Visualization (EnergyVis), pages 12–17, 2024. doi: 10.1109/ EnergyVis63885.2024.00007

work page arXiv 2024
[26]

Bonadia, Fernanda C

Rodrigo S. Bonadia, Fernanda C. L. Trindade, Walmir Freitas, and Bala Venkatesh. On the potential of ChatGPT to generate distribution systems for load flow studies using OpenDSS.IEEE Transactions on Power Systems, 38 (6):5965–5968, 2023. doi: 10.1109/TPWRS.2023.3315543

work page doi:10.1109/tpwrs.2023.3315543 2023
[27]

Enabling large language models to perform power system simulations with previously unseen tools: A case of Daline.arXiv preprint arXiv:2406.17215, 2024

Mengshuo Jia, Zeyu Cui, and Gabriela Hug. Enabling large language models to perform power system simulations with previously unseen tools: A case of Daline.arXiv preprint arXiv:2406.17215, 2024

work page arXiv 2024
[28]

Enhancing LLMs for power system simulations: A feedback-driven multi-agent framework.arXiv preprint arXiv:2411.16707, 2024

Mengshuo Jia, Zeyu Cui, and Gabriela Hug. Enhancing LLMs for power system simulations: A feedback-driven multi-agent framework.arXiv preprint arXiv:2411.16707, 2024

work page arXiv 2024
[29]

A systematic review of transformers and large language models in the energy sector: Towards agentic digital twins.Applied Energy, 401:126670, 2025

Gabriel Antonesi, Tudor Cioara, Ionut Anghel, Vasilis Michalakopoulos, Elissaios Sarmas, and Liana Toderean. A systematic review of transformers and large language models in the energy sector: Towards agentic digital twins.Applied Energy, 401:126670, 2025. doi: 10.1016/j.apenergy.2025.126670

work page doi:10.1016/j.apenergy.2025.126670 2025
[30]

Large language models meet energy systems: Opportunities, challenges, and future perspectives.Applied Energy, 403:127076, 2026

Chaobo Zhang, Jian Zhang, Jie Lu, and Yang Zhao. Large language models meet energy systems: Opportunities, challenges, and future perspectives.Applied Energy, 403:127076, 2026. doi: 10.1016/j.apenergy.2025.127076

work page doi:10.1016/j.apenergy.2025.127076 2026
[31]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[32]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023
[33]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs.arXiv preprint arXiv:2304.08244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Toolace: Winning the points of LLM function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of LLM function calling. InInternational Conference on Learning Representations, 2025

2025
[36]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020
[37]

Repocoder: Repository-level code completion through iterative retrieval and generation

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 42

2023
[38]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7036...

work page doi:10.18653/v1/2024.naacl-long.389 2024
[39]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...

2023
[41]

Anderson, David R

Lorin W. Anderson, David R. Krathwohl, Peter W. Airasian, Kathleen A. Cruikshank, Richard E. Mayer, Paul R. Pintrich, James Raths, and Merlin C. Wittrock.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, 2001

2001
[42]

Paris, Marjorie Y

Scott G. Paris, Marjorie Y. Lipson, and Karen K. Wixson. Becoming a strategic reader.Contemporary Educational Psychology, 8(3):293–316, 1983. doi: 10.1016/0361-476X(83)90018-8

work page doi:10.1016/0361-476x(83)90018-8 1983
[43]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[44]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

2022
[45]

The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. doi: 10.1561/1500000019

work page doi:10.1561/1500000019 2009
[46]

Sentence-BERT: Sentence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019

2019
[47]

Smola, Arthur Gretton, Karsten M

Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Schölkopf. Correcting sample selection bias by unlabeled data. InAdvances in Neural Information Processing Systems, volume 19, pages 601–608, 2007

2007
[48]

Zimmerman, Carlos E

Ray D. Zimmerman, Carlos E. Murillo-Sánchez, and Robert J. Thomas. Matpower: Steady-state operations, planning, and analysis tools for power systems research and education.IEEE Transactions on Power Systems, 26 (1):12–19, 2011. doi: 10.1109/TPWRS.2010.2051168

work page doi:10.1109/tpwrs.2010.2051168 2011
[49]

Dugan and Thomas E

Roger C. Dugan and Thomas E. McDermott. An open source platform for collaborating on smart grid research. In2011 IEEE Power and Energy Society General Meeting, pages 1–7, 2011. doi: 10.1109/PES.2011.6039829

work page doi:10.1109/pes.2011.6039829 2011
[50]

Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021

Hantao Cui, Fangxing Li, and Kevin Tomsovic. Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021. doi: 10.1109/TPWRS.2020.3017019. 43

work page doi:10.1109/tpwrs.2020.3017019 2021

[1] [1]

Donti and J

Priya L. Donti and J. Zico Kolter. Machine learning for sustainable energy systems.Annual Review of Environment and Resources, 46:719–747, 2021. doi: 10.1146/annurev-environ-020220-061831

work page doi:10.1146/annurev-environ-020220-061831 2021

[2] [2]

Lopez Garcia, Christophe Ballif, and Matthias Galus

Fabian Heymann, Hugo Quest, Tania B. Lopez Garcia, Christophe Ballif, and Matthias Galus. Reviewing 40 years of artificial intelligence applied to power systems—a taxonomic perspective.Energy and AI, 15:100322,

[3] [3]

doi: 10.1016/j.egyai.2023.100322

work page doi:10.1016/j.egyai.2023.100322 2023

[4] [4]

Large foundation models for power systems

Chenghao Huang, Siyang Li, Ruohong Liu, Hao Wang, and Yize Chen. Large foundation models for power systems. In2024 IEEE Power & Energy Society General Meeting (PESGM), pages 1–5, 2024. doi: 10.1109/ PESGM51994.2024.10688670

work page arXiv 2024

[5] [5]

Thatte, Na Li, and Le Xie

Subir Majumder, Lin Dong, Fatemeh Doudi, Yuting Cai, Chao Tian, Dileep Kalathil, Kevin Ding, Anupam A. Thatte, Na Li, and Le Xie. Exploring the capabilities and limitations of large language models in the electric energy sector.Joule, 8(6):1544–1549, 2024. doi: 10.1016/j.joule.2024.05.009

work page doi:10.1016/j.joule.2024.05.009 2024

[6] [6]

Thurner, A

Leon Thurner, Alexander Scheidler, Florian Schäfer, Jan-Hendrik Menke, Julian Dollichon, Friederike Meier, Steffen Meinecke, and Martin Braun. pandapower—an open-source python tool for convenient modeling, analysis, and optimization of electric power systems.IEEE Transactions on Power Systems, 33(6):6510–6521, 2018. doi: 10.1109/TPWRS.2018.2829021. 40

work page doi:10.1109/tpwrs.2018.2829021 2018

[7] [7]

AI in power systems: A systematic review of key matters of concern.Energy Informatics, 8:76, 2025

Felipe Henao, Robert Edgell, Ambar Sharma, and Jeffrey Olney. AI in power systems: A systematic review of key matters of concern.Energy Informatics, 8:76, 2025. doi: 10.1186/s42162-025-00529-1

work page doi:10.1186/s42162-025-00529-1 2025

[8] [8]

Secure and trustworthy energy systems: A four-layer threat model and defense-in-depth framework.Energy, 344:140027, 2026

Yuheng Cheng, Xiyuan Zhou, Huan Zhao, Gaoqi Liang, Fushuan Wen, and Junhua Zhao. Secure and trustworthy energy systems: A four-layer threat model and defense-in-depth framework.Energy, 344:140027, 2026. doi: 10.1016/j.energy.2026.140027

work page doi:10.1016/j.energy.2026.140027 2026

[9] [9]

CodeUpdateArena: Bench- marking knowledge editing on API updates.ArXiv, abs/2407.06249, 2024

Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. CodeUpdateArena: Benchmarking knowledge editing on API updates.arXiv preprint arXiv:2407.06249, 2024

work page arXiv 2024

[10] [10]

LibEvolutionEval: A benchmark and study for version-specific code generation

Sachit Kuhar, Wasi Uddin Ahmad, Zijian Wang, Nihal Jain, Haifeng Qian, Baishakhi Ray, Murali Krishna Ramanathan, Xiaofei Ma, and Anoop Deoras. LibEvolutionEval: A benchmark and study for version-specific code generation. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

work page arXiv 2025

[11] [11]

Identifying and mitigating API misuse in large language models.arXiv preprint arXiv:2503.22821, 2025

Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. Identifying and mitigating API misuse in large language models.arXiv preprint arXiv:2503.22821, 2025

work page arXiv 2025

[12] [12]

Towards mitigating API hallucination in code generated by LLMs with hierarchical dependency aware.arXiv preprint arXiv:2505.05057, 2025

Yujia Chen, Mingyu Chen, Cuiyun Gao, Zhihan Jiang, Zhongqi Li, and Yuchi Ma. Towards mitigating API hallucination in code generated by LLMs with hierarchical dependency aware.arXiv preprint arXiv:2505.05057, 2025

work page arXiv 2025

[13] [13]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Gridmind: Llms-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

Hongwei Jin, Kibaek Kim, and Jonghwan Kwon. Gridmind: Llms-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

work page arXiv 2025

[15] [15]

Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025

Qian Zhang and Le Xie. Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025. doi: 10.1109/MPE.2025.3579718

work page doi:10.1109/mpe.2025.3579718 2025

[16] [16]

A large language model for advanced power dispatch.Scientific Reports, 15:8925, 2025

Yuheng Cheng, Huan Zhao, Xiyuan Zhou, Junhua Zhao, Yuji Cao, Chao Yang, and Xinlei Cai. A large language model for advanced power dispatch.Scientific Reports, 15:8925, 2025. doi: 10.1038/s41598-025-91940-x

work page doi:10.1038/s41598-025-91940-x 2025

[17] [17]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In Advances in Neural Information Processing Systems, volume 34, pages 20389–20403, 2021

2021

[20] [20]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

2024

[21] [21]

DS-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. DS-1000: A natural and reliable benchmark for data science code generation. In Proceedings of the 40th International Conference on Machine Learning, pages 18319–18345, 2023. 41

2023

[22] [22]

Scicode: A research coding benchmark curated by scientists

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, et al. Scicode: A research coding benchmark curated by scientists. InAdvances in Neural Information Processing Systems, 2024

2024

[23] [23]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. InInternational Conference on Learning Representations, 2025

2025

[24] [24]

Elecbench: A power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365, 2024

Xiyuan Zhou, Huan Zhao, Yuheng Cheng, Yuji Cao, Gaoqi Liang, Guolong Liu, Wenxuan Liu, Yan Xu, and Junhua Zhao. Elecbench: A power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365, 2024

work page arXiv 2024

[25] [25]

Abhyankar

Shuangshuang Jin and Shrirang G. Abhyankar. Chatgrid: Power grid visualization empowered by large language model. In2024 IEEE Workshop on Energy Data Visualization (EnergyVis), pages 12–17, 2024. doi: 10.1109/ EnergyVis63885.2024.00007

work page arXiv 2024

[26] [26]

Bonadia, Fernanda C

Rodrigo S. Bonadia, Fernanda C. L. Trindade, Walmir Freitas, and Bala Venkatesh. On the potential of ChatGPT to generate distribution systems for load flow studies using OpenDSS.IEEE Transactions on Power Systems, 38 (6):5965–5968, 2023. doi: 10.1109/TPWRS.2023.3315543

work page doi:10.1109/tpwrs.2023.3315543 2023

[27] [27]

Enabling large language models to perform power system simulations with previously unseen tools: A case of Daline.arXiv preprint arXiv:2406.17215, 2024

Mengshuo Jia, Zeyu Cui, and Gabriela Hug. Enabling large language models to perform power system simulations with previously unseen tools: A case of Daline.arXiv preprint arXiv:2406.17215, 2024

work page arXiv 2024

[28] [28]

Enhancing LLMs for power system simulations: A feedback-driven multi-agent framework.arXiv preprint arXiv:2411.16707, 2024

Mengshuo Jia, Zeyu Cui, and Gabriela Hug. Enhancing LLMs for power system simulations: A feedback-driven multi-agent framework.arXiv preprint arXiv:2411.16707, 2024

work page arXiv 2024

[29] [29]

A systematic review of transformers and large language models in the energy sector: Towards agentic digital twins.Applied Energy, 401:126670, 2025

Gabriel Antonesi, Tudor Cioara, Ionut Anghel, Vasilis Michalakopoulos, Elissaios Sarmas, and Liana Toderean. A systematic review of transformers and large language models in the energy sector: Towards agentic digital twins.Applied Energy, 401:126670, 2025. doi: 10.1016/j.apenergy.2025.126670

work page doi:10.1016/j.apenergy.2025.126670 2025

[30] [30]

Large language models meet energy systems: Opportunities, challenges, and future perspectives.Applied Energy, 403:127076, 2026

Chaobo Zhang, Jian Zhang, Jie Lu, and Yang Zhao. Large language models meet energy systems: Opportunities, challenges, and future perspectives.Applied Energy, 403:127076, 2026. doi: 10.1016/j.apenergy.2025.127076

work page doi:10.1016/j.apenergy.2025.127076 2026

[31] [31]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[32] [32]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023

[33] [33]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs.arXiv preprint arXiv:2304.08244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Toolace: Winning the points of LLM function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of LLM function calling. InInternational Conference on Learning Representations, 2025

2025

[36] [36]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020

[37] [37]

Repocoder: Repository-level code completion through iterative retrieval and generation

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 42

2023

[38] [38]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7036...

work page doi:10.18653/v1/2024.naacl-long.389 2024

[39] [39]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...

2023

[41] [41]

Anderson, David R

Lorin W. Anderson, David R. Krathwohl, Peter W. Airasian, Kathleen A. Cruikshank, Richard E. Mayer, Paul R. Pintrich, James Raths, and Merlin C. Wittrock.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, 2001

2001

[42] [42]

Paris, Marjorie Y

Scott G. Paris, Marjorie Y. Lipson, and Karen K. Wixson. Becoming a strategic reader.Contemporary Educational Psychology, 8(3):293–316, 1983. doi: 10.1016/0361-476X(83)90018-8

work page doi:10.1016/0361-476x(83)90018-8 1983

[43] [43]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[44] [44]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

2022

[45] [45]

The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. doi: 10.1561/1500000019

work page doi:10.1561/1500000019 2009

[46] [46]

Sentence-BERT: Sentence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019

2019

[47] [47]

Smola, Arthur Gretton, Karsten M

Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Schölkopf. Correcting sample selection bias by unlabeled data. InAdvances in Neural Information Processing Systems, volume 19, pages 601–608, 2007

2007

[48] [48]

Zimmerman, Carlos E

Ray D. Zimmerman, Carlos E. Murillo-Sánchez, and Robert J. Thomas. Matpower: Steady-state operations, planning, and analysis tools for power systems research and education.IEEE Transactions on Power Systems, 26 (1):12–19, 2011. doi: 10.1109/TPWRS.2010.2051168

work page doi:10.1109/tpwrs.2010.2051168 2011

[49] [49]

Dugan and Thomas E

Roger C. Dugan and Thomas E. McDermott. An open source platform for collaborating on smart grid research. In2011 IEEE Power and Energy Society General Meeting, pages 1–7, 2011. doi: 10.1109/PES.2011.6039829

work page doi:10.1109/pes.2011.6039829 2011

[50] [50]

Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021

Hantao Cui, Fangxing Li, and Kevin Tomsovic. Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021. doi: 10.1109/TPWRS.2020.3017019. 43

work page doi:10.1109/tpwrs.2020.3017019 2021