EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

· 2025 · cs.AI · arXiv 2509.17677

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyond symbolic computation. Existing benchmarks largely focus on well-defined or abstract reasoning and therefore fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experimental results show clear performance stratification across difficulty levels: model accuracy declines with task complexity, degrades under minor perturbations, and remains substantially below human performance on high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/AI4Engi/EngiBench.

representative citing papers

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

cs.AI · 2026-03-25 · accept · novelty 7.0

ThermoQA benchmark shows top LLMs reach 92-94% overall on thermodynamics problems but degrade sharply on full cycle analysis, confirming that property knowledge does not equal reasoning ability.

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

cs.AI · 2026-05-11 · conditional · novelty 6.0 · 3 refs

IndustryBench is a standards-grounded Chinese benchmark that exposes LLMs' persistent gaps in industrial terminology, safety compliance, and parameter accuracy, with safety checks reshuffling model rankings.

How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

cs.CL · 2026-04-06 · unverdicted · novelty 5.0

LLMs exhibit a persistent comprehension-execution gap in end-to-end mathematical modeling tasks, with a new stage-wise framework showing better alignment to human expert judgments than prior schemes.

Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving

cs.CY · 2025-12-10 · conditional · novelty 5.0

Hybrid pipeline using YOLO vision and ngspice verification raises circuit analysis accuracy from Gemini's 79.52% baseline to 97.59%, with similar gains on hand-drawn diagrams.

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

cs.AI · 2026-05-19

citing papers explorer

Showing 5 of 5 citing papers.

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models cs.AI · 2026-03-25 · accept · none · ref 9 · internal anchor
ThermoQA benchmark shows top LLMs reach 92-94% overall on thermodynamics problems but degrade sharply on full cycle analysis, confirming that property knowledge does not equal reasoning ability.
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs cs.AI · 2026-05-11 · conditional · none · ref 4 · 3 links · internal anchor
IndustryBench is a standards-grounded Chinese benchmark that exposes LLMs' persistent gaps in industrial terminology, safety compliance, and parameter accuracy, with safety checks reshuffling model rankings.
How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling cs.CL · 2026-04-06 · unverdicted · none · ref 1 · internal anchor
LLMs exhibit a persistent comprehension-execution gap in end-to-end mathematical modeling tasks, with a new stage-wise framework showing better alignment to human expert judgments than prior schemes.
Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving cs.CY · 2025-12-10 · conditional · none · ref 26 · internal anchor
Hybrid pipeline using YOLO vision and ngspice verification raises circuit analysis accuracy from Gemini's 79.52% baseline to 97.59%, with similar gains on hand-drawn diagrams.
EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design cs.AI · 2026-05-19 · unreviewed · ref 40 · internal anchor

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

fields

years

verdicts

representative citing papers

citing papers explorer