Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving

arxiv: 2512.10159 · v2 · submitted 2025-12-10 · 💻 cs.CY · cs.AI· cs.HC

Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving

Liangliang Chen , Weiyu Sun , Huiru Xie , Yongnuo Cai , Ying Zhang This is my paper

Pith reviewed 2026-05-16 22:49 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HC

keywords circuit analysislarge language modelsvision detectionsimulation verificationengineering educationhand-drawn diagramsproblem solving

0 comments p. Extension

The pith

A pipeline that adds source detection and simulation verification raises LLM accuracy on circuit problems from 79 percent to 97 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that LLMs alone produce frequent errors when interpreting circuit diagrams and performing the required calculations. It shows these errors can be largely corrected by inserting a vision detector that isolates and correctly labels sources, followed by a simulation loop that flags mismatches and prompts the model to revise its work. The resulting system reaches 97.59 percent accuracy on standard undergraduate problems and stays above 93 percent on hand-drawn variants, suggesting a practical route to reliable automated solvers for engineering tasks.

Core claim

The authors build an end-to-end framework on Gemini that first applies a fine-tuned YOLO detector and OpenCV processing to extract accurate polarity information from circuit diagrams, then generates candidate solutions that are checked and iteratively refined inside an ngspice simulation loop whenever discrepancies appear, achieving 97.59 percent accuracy on benchmark undergraduate problems and 93.94 to 95.45 percent on hand-drawn diagram variations.

What carries the argument

The ngspice-driven verification loop that compares LLM-generated solutions against simulation outputs and triggers targeted refinements when mismatches are detected.

Load-bearing premise

Differences between the LLM-proposed solution and ngspice simulation results reliably indicate mistakes in the LLM output rather than limitations of the simulation model or unmodeled real-world effects.

What would settle it

A test set of circuits where laboratory measurements match the LLM solution but diverge from ngspice predictions, or where ngspice matches measurements but the LLM solution differs, would show whether the verification loop correctly attributes errors.

Figures

Figures reproduced from arXiv: 2512.10159 by Huiru Xie, Liangliang Chen, Weiyu Sun, Ying Zhang, Yongnuo Cai.

**Figure 4.** Figure 4: Framework of the proposed circuit problem solving method [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Precision–Recall curve of the best fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Circuit diagram samples from the dataset used for YOLO model fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Circuit diagram samples illustrating the detection of independent and dependent [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

read the original abstract

LLMs have demonstrated strong performance in data-rich domains such as programming, yet their reliability in engineering tasks remains limited. Circuit analysis--requiring multimodal understanding and precise mathematical reasoning--highlights these challenges. Although Gemini 2.5 Pro shows improved capabilities in diagram interpretation and analog-circuit reasoning, it still struggles to consistently produce correct solutions when given both textual problem descriptions and circuit diagrams. Meanwhile, engineering education demands scalable AI tools capable of generating accurate solutions for applications such as automated homework feedback. This paper presents an enhanced end-to-end circuit problem-solving framework built upon Gemini. We first conduct a systematic benchmark on undergraduate circuit problems and identify two key failure modes: 1) circuit-recognition hallucinations, particularly incorrect source polarity detection, and 2) reasoning-process hallucinations, such as incorrect current direction assumptions. To address recognition errors, we integrate a fine-tuned YOLO detector and OpenCV-based processing to isolate voltage and current sources, enabling Gemini to accurately re-identify source polarities from cropped images. To mitigate reasoning errors, we introduce an ngspice-driven verification loop, in which simulation discrepancies trigger iterative solution refinement with optional HITL feedback. Experimental results demonstrate that the proposed pipeline achieves 97.59% accuracy, substantially outperforming Gemini's baseline of 79.52%. Furthermore, on four variations of hand-drawn circuit diagrams, accuracy improves from 56.06%--71.21% to 93.94%--95.45% with statistically significant gains. These results highlight the robustness, scalability, and practical applicability of the proposed framework for engineering education and real-world circuit analysis tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets real accuracy lifts on circuit problems by bolting YOLO source isolation and an ngspice check loop onto Gemini, but the loop's assumption that every mismatch is an LLM mistake needs more checking.

read the letter

The core advance here is a working end-to-end system that first uses a fine-tuned YOLO plus OpenCV to crop and correct source polarities in diagrams, then feeds those fixes to Gemini and runs an iterative ngspice verification loop to catch and repair reasoning slips. On the reported undergrad set this moves accuracy from 79.52% to 97.59%, and on the four hand-drawn variants it moves the range from 56-71% up to 93-95%. Those numbers are the clearest thing the work delivers, and the choice to benchmark against a plain Gemini baseline makes the improvement easy to see. The approach is also straightforward to implement if someone already has access to the two external tools, which counts as a practical plus for education-focused groups. Dataset size, exact problem mix, and full error breakdown are still missing from what is shown, so the headline percentages are harder to generalize than they first appear. The bigger open question is whether every ngspice mismatch truly signals an LLM error rather than a netlist extraction glitch or an ideal-model limitation; the paper treats the simulator as an oracle without showing how often those other sources of difference occur. That assumption is load-bearing for the iterative loop and should be stress-tested with manual review of the mismatch cases. The work is aimed at people who want to build reliable homework or tutoring tools for basic circuit courses rather than at researchers looking for new theoretical insight. It is solid enough on the empirical side to merit a serious referee round, mainly so the dataset and mismatch analysis can be filled in.

Referee Report

2 major / 2 minor

Summary. The paper proposes an enhanced pipeline for end-to-end circuit analysis problem solving built on Gemini 2.5 Pro. It augments the LLM with a fine-tuned YOLO detector plus OpenCV processing to correct circuit-recognition hallucinations (especially source polarity) and an ngspice-driven verification loop that triggers iterative refinement when simulation discrepancies are detected. The central claims are a jump from 79.52% to 97.59% accuracy on undergraduate problems and from 56.06–71.21% to 93.94–95.45% on four variants of hand-drawn diagrams, with statistical significance reported.

Significance. If the empirical results hold under a more complete experimental protocol, the work would be significant for engineering education. It demonstrates a practical, modular way to combine vision models and circuit simulators with LLMs to reach high reliability on a task that pure LLMs still handle poorly. The reliance on externally validated components (YOLO, ngspice) rather than purely learned parameters is a methodological strength.

major comments (2)

[ngspice-driven verification loop] The ngspice verification loop (described in the proposed pipeline and Experimental Results) treats every mismatch between the LLM-proposed voltages/currents and ngspice output as an LLM reasoning error. This assumption is load-bearing for the 97.59% accuracy claim yet is not tested against cases where ngspice (ideal models) diverges from the analytical solution because of netlist extraction errors, unmodeled parasitics, or polarity re-identification failures on cropped images.
[Experimental Results] Experimental Results section: the manuscript reports accuracy figures and statistical significance but provides no dataset size, problem-type distribution, full experimental protocol, or breakdown of remaining error types. Without these, it is impossible to assess whether the reported gains generalize or whether the test set is representative of undergraduate circuit problems.

minor comments (2)

[Experimental Results] The four hand-drawn diagram variations are mentioned but not illustrated or described in sufficient detail to allow replication or to understand what visual perturbations were introduced.
[Proposed Framework] Notation for the iterative refinement process (e.g., how many iterations are allowed, when HITL is invoked) is introduced informally and would benefit from a concise pseudocode or flowchart.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our evaluation protocol. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: The ngspice verification loop (described in the proposed pipeline and Experimental Results) treats every mismatch between the LLM-proposed voltages/currents and ngspice output as an LLM reasoning error. This assumption is load-bearing for the 97.59% accuracy claim yet is not tested against cases where ngspice (ideal models) diverges from the analytical solution because of netlist extraction errors, unmodeled parasitics, or polarity re-identification failures on cropped images.

Authors: We agree that the verification loop implicitly assumes discrepancies arise from LLM reasoning rather than simulator or extraction issues. Our undergraduate problems use ideal models that match expected analytical solutions, but we acknowledge potential extraction or polarity edge cases. We will revise the manuscript to explicitly state these modeling assumptions, add a dedicated limitations subsection on verification-loop failure modes, and include a small set of controlled examples demonstrating when the loop correctly or incorrectly triggers. revision: yes
Referee: Experimental Results section: the manuscript reports accuracy figures and statistical significance but provides no dataset size, problem-type distribution, full experimental protocol, or breakdown of remaining error types. Without these, it is impossible to assess whether the reported gains generalize or whether the test set is representative of undergraduate circuit problems.

Authors: We thank the referee for this observation. The current Experimental Results section is too concise. We will expand it to report the exact dataset sizes (50 standard problems and 33 hand-drawn variants), the problem-type distribution (e.g., DC resistive, AC phasor, transient), the full selection and annotation protocol, and a breakdown of the three residual errors. These additions will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline relies on external independent tools

full rationale

The paper presents an engineering pipeline that augments Gemini with a fine-tuned YOLO detector, OpenCV post-processing for source polarity, and an ngspice simulation loop for iterative refinement. Accuracy figures (97.59% vs. 79.52% baseline; 93.94–95.45% on hand-drawn variants) are obtained by comparing final outputs against ground-truth solutions for undergraduate circuit problems. No derivation step equates a claimed result to its own inputs by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing premise rests on self-citation chains. The ngspice verification treats the simulator as an external oracle whose discrepancies drive refinement; this is an empirical assumption about model fidelity rather than a self-definitional equivalence. The overall chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework depends on the established reliability of YOLO-style object detection for diagram elements and ngspice as an accurate circuit simulator, without introducing new free parameters, axioms beyond standard tool assumptions, or invented entities.

axioms (2)

domain assumption A fine-tuned YOLO detector can reliably locate and isolate voltage and current sources in both printed and hand-drawn circuit diagrams
Invoked to correct recognition hallucinations
domain assumption ngspice simulation results provide a trustworthy external check on the correctness of proposed circuit solutions
Central to the verification and refinement loop

pith-pipeline@v0.9.0 · 5603 in / 1289 out tokens · 51015 ms · 2026-05-16T22:49:38.645493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- 72 man, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024), page 225, Malta,

work page 2024
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

On the application of large language models for language teaching and assessment technology.arXiv preprint arXiv:2307.08393,

Andrew Caines, Luca Benedetto, Shiva Taslimipoor, Christopher Davis, Yuan Gao, Oeistein Andersen, Zheng Yuan, Mark Elliott, Russell Moore, Christo- pher Bryant, et al. On the application of large language models for language teaching and assessment technology.arXiv preprint arXiv:2307.08393,

work page arXiv
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Liangliang Chen, Zhihao Qin, Yiming Guo, Jacqueline Rohde, and Ying Zhang. Benchmarking large language models on homework assessment in circuit analysis.International Journal of Artificial Intelligence in Education, pages 1–62, 2025a. 73 Liangliang Chen, Huiru Xie, Zhihao Qin, Yiming Guo, Jacqueline Rohde, and Ying Zhang. Enhancing large language models f...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Educhat: A large-scale language model-based chatbot system for intelligent education.arXiv preprint arXiv:2308.02773,

Yuhao Dan, Zhikai Lei, Yiyang Gu, Yong Li, Jianghao Yin, Jiaju Lin, Linhao Ye, Zhiyan Tie, Yougen Zhou, Yilei Wang, et al. Educhat: A large-scale language model-based chatbot system for intelligent education.arXiv preprint arXiv:2308.02773,

work page arXiv
[7]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval- augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1),

work page internal anchor Pith review Pith/arXiv arXiv
[8]

David Grey and Corrina Osborne

URL https://blog.google/technology/google-deepmind/gemini-m odel-thinking-updates-march-2025. David Grey and Corrina Osborne. Perceptions and principles of personal tutoring.Journal of Further and Higher Education, 44(3):285–299,

work page 2025
[9]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Beyond final answers: Evaluating large language models for math tutoring

Adit Gupta, Jennifer Reddig, Tommaso Calo, Daniel Weitekamp, and Christo- pher J MacLellan. Beyond final answers: Evaluating large language models for math tutoring. InProceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025), pages 323–337, Palermo, Italy,

work page 2025
[11]

Aitee– agentic tutor for electrical engineering.arXiv preprint arXiv:2505.21582,

Christopher Knievel, Alexander Bernhardt, and Christian Bernhardt. Aitee– agentic tutor for electrical engineering.arXiv preprint arXiv:2505.21582,

work page arXiv
[12]

Harsh Kumar, David M Rothschild, Daniel G Goldstein, and Jake M Hof- man. Math education with large language models: Peril or promise? In 75 Proceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025), pages 60–75, Palermo, Italy,

work page 2025
[13]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al

Springer. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. InProceedings of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS 2020), pages 9459–9474,

work page 2020
[14]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

FEABench: Evaluating language models on multiphysics reasoning ability,

76 Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael P Brenner, and Peter Norgaard. Feabench: Evaluating language models on multiphysics reasoning ability.arXiv preprint arXiv:2504.06260,

work page arXiv
[16]

Co-designing large language model tools for project-based learning with k12 educators

Prerna Ravi, John Masla, Gisella Kakoti, Grace C Lin, Emma Anderson, Matt Taylor, Anastasia K Ostrowski, Cynthia Breazeal, Eric Klopfer, and Hal Abelson. Co-designing large language model tools for project-based learning with k12 educators. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–25,

work page 2025
[17]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Circuit: A benchmark for circuit interpretation and reasoning capabilities of llms.arXiv preprint arXiv:2502.07980,

Lejla Skelic, Yan Xu, Matthew Cox, Wenjie Lu, Tao Yu, and Ruonan Han. Circuit: A benchmark for circuit interpretation and reasoning capabilities of llms.arXiv preprint arXiv:2502.07980,

work page arXiv
[20]

Step-based tutoring system for introductory linear circuit analysis

77 Brian J Skromme, Paul J Rayes, Brian E McNamara, Vignesh Seetharam, Xuefeng Gao, Theodore Thompson, Xiaoxuan Wang, Bing Cheng, Y-F Huang, and Daniel H Robinson. Step-based tutoring system for introductory linear circuit analysis. InProceedings of the 2015 IEEE Frontiers in Education Conference (FIE), pages 1–9, El Paso, USA,

work page 2015
[21]

Step-by- step tutoring support for student success in circuit analysis courses

Brian J Skromme, Megan A O’donnell, and Wendy M Barnard. Step-by- step tutoring support for student success in circuit analysis courses. In Proceedings of the Great Lakes Symposium on VLSI 2024, pages 347–350, Clearwater, FL, USA,

work page 2024
[22]

Data-driven insights into academic success: Analyzing ten years of student academic records in an electrical and computer engineering department

Weiyu Sun, Jacqueline Rohde, Liangliang Chen, Yiming Guo, and Ying Zhang. Data-driven insights into academic success: Analyzing ten years of student academic records in an electrical and computer engineering department. In Proceedings of the 2025 ASEE Annual Conference & Exposition (ASEE 2025), Montreal, Canada,

work page 2025
[23]

2024, arXiv e-prints, arXiv:2403.18105, doi: 10.48550/arXiv.2403.18105

78 Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S Yu, and Qingsong Wen. Large language models for education: A survey and outlook.arXiv preprint arXiv:2403.18105,

work page arXiv
[24]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS 2022), volume 35, pages 24824–24837, New Orleans, USA,

work page 2022
[25]

Elecbench: a power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365, 2024

Xiyuan Zhou, Huan Zhao, Yuheng Cheng, Yuji Cao, Gaoqi Liang, Guolong Liu, Wenxuan Liu, Yan Xu, and Junhua Zhao. Elecbench: a power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365,

work page arXiv
[26]

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, et al. Engibench: A bench- mark for evaluating large language models on engineering problem solving. arXiv preprint arXiv:2509.17677,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- 72 man, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024), page 225, Malta,

work page 2024

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

On the application of large language models for language teaching and assessment technology.arXiv preprint arXiv:2307.08393,

Andrew Caines, Luca Benedetto, Shiva Taslimipoor, Christopher Davis, Yuan Gao, Oeistein Andersen, Zheng Yuan, Mark Elliott, Russell Moore, Christo- pher Bryant, et al. On the application of large language models for language teaching and assessment technology.arXiv preprint arXiv:2307.08393,

work page arXiv

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Liangliang Chen, Zhihao Qin, Yiming Guo, Jacqueline Rohde, and Ying Zhang. Benchmarking large language models on homework assessment in circuit analysis.International Journal of Artificial Intelligence in Education, pages 1–62, 2025a. 73 Liangliang Chen, Huiru Xie, Zhihao Qin, Yiming Guo, Jacqueline Rohde, and Ying Zhang. Enhancing large language models f...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Educhat: A large-scale language model-based chatbot system for intelligent education.arXiv preprint arXiv:2308.02773,

Yuhao Dan, Zhikai Lei, Yiyang Gu, Yong Li, Jianghao Yin, Jiaju Lin, Linhao Ye, Zhiyan Tie, Yougen Zhou, Yilei Wang, et al. Educhat: A large-scale language model-based chatbot system for intelligent education.arXiv preprint arXiv:2308.02773,

work page arXiv

[7] [7]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval- augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1),

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

David Grey and Corrina Osborne

URL https://blog.google/technology/google-deepmind/gemini-m odel-thinking-updates-march-2025. David Grey and Corrina Osborne. Perceptions and principles of personal tutoring.Journal of Further and Higher Education, 44(3):285–299,

work page 2025

[9] [9]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Beyond final answers: Evaluating large language models for math tutoring

Adit Gupta, Jennifer Reddig, Tommaso Calo, Daniel Weitekamp, and Christo- pher J MacLellan. Beyond final answers: Evaluating large language models for math tutoring. InProceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025), pages 323–337, Palermo, Italy,

work page 2025

[11] [11]

Aitee– agentic tutor for electrical engineering.arXiv preprint arXiv:2505.21582,

Christopher Knievel, Alexander Bernhardt, and Christian Bernhardt. Aitee– agentic tutor for electrical engineering.arXiv preprint arXiv:2505.21582,

work page arXiv

[12] [12]

Harsh Kumar, David M Rothschild, Daniel G Goldstein, and Jake M Hof- man. Math education with large language models: Peril or promise? In 75 Proceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025), pages 60–75, Palermo, Italy,

work page 2025

[13] [13]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al

Springer. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. InProceedings of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS 2020), pages 9459–9474,

work page 2020

[14] [14]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

FEABench: Evaluating language models on multiphysics reasoning ability,

76 Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael P Brenner, and Peter Norgaard. Feabench: Evaluating language models on multiphysics reasoning ability.arXiv preprint arXiv:2504.06260,

work page arXiv

[16] [16]

Co-designing large language model tools for project-based learning with k12 educators

Prerna Ravi, John Masla, Gisella Kakoti, Grace C Lin, Emma Anderson, Matt Taylor, Anastasia K Ostrowski, Cynthia Breazeal, Eric Klopfer, and Hal Abelson. Co-designing large language model tools for project-based learning with k12 educators. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–25,

work page 2025

[17] [17]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Circuit: A benchmark for circuit interpretation and reasoning capabilities of llms.arXiv preprint arXiv:2502.07980,

Lejla Skelic, Yan Xu, Matthew Cox, Wenjie Lu, Tao Yu, and Ruonan Han. Circuit: A benchmark for circuit interpretation and reasoning capabilities of llms.arXiv preprint arXiv:2502.07980,

work page arXiv

[20] [20]

Step-based tutoring system for introductory linear circuit analysis

77 Brian J Skromme, Paul J Rayes, Brian E McNamara, Vignesh Seetharam, Xuefeng Gao, Theodore Thompson, Xiaoxuan Wang, Bing Cheng, Y-F Huang, and Daniel H Robinson. Step-based tutoring system for introductory linear circuit analysis. InProceedings of the 2015 IEEE Frontiers in Education Conference (FIE), pages 1–9, El Paso, USA,

work page 2015

[21] [21]

Step-by- step tutoring support for student success in circuit analysis courses

Brian J Skromme, Megan A O’donnell, and Wendy M Barnard. Step-by- step tutoring support for student success in circuit analysis courses. In Proceedings of the Great Lakes Symposium on VLSI 2024, pages 347–350, Clearwater, FL, USA,

work page 2024

[22] [22]

Data-driven insights into academic success: Analyzing ten years of student academic records in an electrical and computer engineering department

Weiyu Sun, Jacqueline Rohde, Liangliang Chen, Yiming Guo, and Ying Zhang. Data-driven insights into academic success: Analyzing ten years of student academic records in an electrical and computer engineering department. In Proceedings of the 2025 ASEE Annual Conference & Exposition (ASEE 2025), Montreal, Canada,

work page 2025

[23] [23]

2024, arXiv e-prints, arXiv:2403.18105, doi: 10.48550/arXiv.2403.18105

78 Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S Yu, and Qingsong Wen. Large language models for education: A survey and outlook.arXiv preprint arXiv:2403.18105,

work page arXiv

[24] [24]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS 2022), volume 35, pages 24824–24837, New Orleans, USA,

work page 2022

[25] [25]

Elecbench: a power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365, 2024

Xiyuan Zhou, Huan Zhao, Yuheng Cheng, Yuji Cao, Gaoqi Liang, Guolong Liu, Wenxuan Liu, Yan Xu, and Junhua Zhao. Elecbench: a power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365,

work page arXiv

[26] [26]

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, et al. Engibench: A bench- mark for evaluating large language models on engineering problem solving. arXiv preprint arXiv:2509.17677,

work page internal anchor Pith review Pith/arXiv arXiv