LLMs achieve strong initial accuracy on code output prediction but frequently alter their answers under semantics-preserving mutations, with drops up to 70% and flawed reasoning detected in 10-50% of correct cases via human review.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SE 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?
LLMs achieve strong initial accuracy on code output prediction but frequently alter their answers under semantics-preserving mutations, with drops up to 70% and flawed reasoning detected in 10-50% of correct cases via human review.