ReFEree evaluates factual consistency in real-world code summaries at segment level using reference-free criteria and dependency context, achieving 15-18% higher correlation with human judgments than prior state-of-the-art methods on a new benchmark.
Since LLM-Judge, G-Eval, and FactScore are originally designed as evaluation methods for the NLP do- main, we modify their prompts to ensure applica- bility to the code domain
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization
ReFEree evaluates factual consistency in real-world code summaries at segment level using reference-free criteria and dependency context, achieving 15-18% higher correlation with human judgments than prior state-of-the-art methods on a new benchmark.