Automated LLM-based evaluation of code review bot comments achieves only moderate agreement (0.44-0.62) with developer labels in an industrial dataset because developer decisions reflect contextual constraints beyond comment quality.
The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SE 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Understanding the Limits of Automated Evaluation for Code Review Bots in Practice
Automated LLM-based evaluation of code review bot comments achieves only moderate agreement (0.44-0.62) with developer labels in an industrial dataset because developer decisions reflect contextual constraints beyond comment quality.