Frontier LLMs score 73-97% on a novel relativity concept inventory but fail entirely on a few items due to visual misinterpretation, with more consistent errors than students.
none"), used for the qualitative error analysis in sections 4.2 and 4.3. The columns (med) and (high) report GPT -5.2 accuracy at reasoning_effort =
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
physics.ed-ph 1years
2026 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
Performance and failure modes of AI chatbots on a novel concept inventory on relativity in classical mechanics
Frontier LLMs score 73-97% on a novel relativity concept inventory but fail entirely on a few items due to visual misinterpretation, with more consistent errors than students.