ThermoQA benchmark shows top LLMs reach 92-94% overall on thermodynamics problems but degrade sharply on full cycle analysis, confirming that property knowledge does not equal reasoning ability.
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
5 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyond symbolic computation. Existing benchmarks largely focus on well-defined or abstract reasoning and therefore fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experimental results show clear performance stratification across difficulty levels: model accuracy declines with task complexity, degrades under minor perturbations, and remains substantially below human performance on high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/AI4Engi/EngiBench.
representative citing papers
IndustryBench is a standards-grounded Chinese benchmark that exposes LLMs' persistent gaps in industrial terminology, safety compliance, and parameter accuracy, with safety checks reshuffling model rankings.
LLMs exhibit a persistent comprehension-execution gap in end-to-end mathematical modeling tasks, with a new stage-wise framework showing better alignment to human expert judgments than prior schemes.
Hybrid pipeline using YOLO vision and ngspice verification raises circuit analysis accuracy from Gemini's 79.52% baseline to 97.59%, with similar gains on hand-drawn diagrams.
citing papers explorer
-
ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
ThermoQA benchmark shows top LLMs reach 92-94% overall on thermodynamics problems but degrade sharply on full cycle analysis, confirming that property knowledge does not equal reasoning ability.
-
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
IndustryBench is a standards-grounded Chinese benchmark that exposes LLMs' persistent gaps in industrial terminology, safety compliance, and parameter accuracy, with safety checks reshuffling model rankings.
-
How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling
LLMs exhibit a persistent comprehension-execution gap in end-to-end mathematical modeling tasks, with a new stage-wise framework showing better alignment to human expert judgments than prior schemes.
-
Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving
Hybrid pipeline using YOLO vision and ngspice verification raises circuit analysis accuracy from Gemini's 79.52% baseline to 97.59%, with similar gains on hand-drawn diagrams.
- EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design