Recognition: unknown
ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization
Pith reviewed 2026-05-10 15:28 UTC · model grok-4.3
The pith
ReFEree checks factual accuracy in long real-world code summaries without references by scoring inconsistencies at the segment level with dependency information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReFEree defines factual inconsistency criteria specific to code summaries and evaluates them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. On a newly constructed benchmark containing human-annotated factual consistency labels for real-world code summaries, ReFEree records the highest correlation with human judgment among thirteen baselines.
What carries the argument
Segment-level evaluation that applies code-specific factual inconsistency criteria together with dependency checks before aggregation into an overall score.
If this is right
- ReFEree supplies both an overall score and per-segment diagnoses that identify exactly which parts of a summary contain inconsistencies.
- The method works without any reference summary, removing the need for gold-standard text that is often unavailable for real codebases.
- The new benchmark dataset enables direct comparison of future evaluation methods on the same human-labeled real-world examples.
- Higher correlation with humans means automatic scores can more reliably guide selection or fine-tuning of code summarization models.
Where Pith is reading between the lines
- Teams building production code-documentation tools could embed the segment-level checker to flag risky summaries before they reach users.
- The same dependency-aware segmentation idea might transfer to evaluating long-form outputs in related tasks such as commit-message generation or API documentation.
- Because the inconsistency criteria are manually specified, they will require periodic review whenever common coding patterns or documentation styles shift.
- Researchers could test whether replacing the fixed criteria with learned ones trained on the human annotations further raises correlation.
Load-bearing premise
The authors' hand-defined list of factual inconsistency types for code summaries, combined with segment-level dependency checks, matches human judgments of factual consistency across diverse real codebases.
What would settle it
A fresh human annotation study on summaries drawn from many additional projects and programming languages in which ReFEree's segment scores show substantially lower agreement with the new human labels than reported on the original benchmark.
Figures
read the original abstract
As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world LLM-generated code summaries. It hand-defines factual inconsistency criteria specific to code, scores summaries at the segment level using these criteria plus dependency information, aggregates to an overall score, constructs a new human-annotated benchmark, and reports that ReFEree achieves the highest correlation with human judgments among 13 baselines with a 15-18% improvement over prior state-of-the-art.
Significance. If the correlation gains hold after addressing potential circularity between the hand-defined criteria and the annotation process, the work would meaningfully advance evaluation of code summarization by handling longer, dependency-rich summaries that prior reference-based or snippet-level metrics cannot address. The public release of code and data is a positive contribution to reproducibility in this area.
major comments (2)
- [Abstract and benchmark construction section] The abstract and benchmark construction section do not describe the human annotation guidelines, inter-annotator agreement, or whether annotators were shown or primed with the same factual inconsistency criteria defined for ReFEree. Because the central claim rests on superior correlation with these human labels, any overlap would render the 15-18% improvement partly tautological rather than an independent validation.
- [Method section] The method section defines inconsistency criteria and segment-level scoring but provides no ablation or sensitivity analysis showing that the reported gains require the dependency information or the specific criteria; without this, it is unclear whether the improvement is driven by the core innovation or by other implementation choices.
minor comments (2)
- [Abstract] The abstract states 'improving 15-18% over the previous state-of-the-art' without naming the exact correlation coefficient (Pearson, Spearman, etc.) or identifying which of the 13 baselines constitutes the prior SOTA.
- [Results section] Figure and table captions should explicitly state the number of summaries, codebases, and annotators in the benchmark to allow readers to assess scale.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of transparency in our benchmark and the need to better isolate the contributions of our method components. We will revise the manuscript to address both major comments fully.
read point-by-point responses
-
Referee: [Abstract and benchmark construction section] The abstract and benchmark construction section do not describe the human annotation guidelines, inter-annotator agreement, or whether annotators were shown or primed with the same factual inconsistency criteria defined for ReFEree. Because the central claim rests on superior correlation with these human labels, any overlap would render the 15-18% improvement partly tautological rather than an independent validation.
Authors: We agree that these details are necessary to establish the independence of the human labels. In the revised manuscript, we will expand the benchmark construction section with: (1) the complete annotation guidelines provided to annotators, (2) the inter-annotator agreement statistics (Cohen's kappa and percentage agreement), and (3) an explicit statement that annotators received no exposure to the ReFEree-specific criteria. Annotators were instead instructed to identify factual inconsistencies based solely on whether summary segments were supported by the provided code and its dependency context, using their own expertise. This will confirm that the reported correlation gains reflect genuine alignment rather than circularity. revision: yes
-
Referee: [Method section] The method section defines inconsistency criteria and segment-level scoring but provides no ablation or sensitivity analysis showing that the reported gains require the dependency information or the specific criteria; without this, it is unclear whether the improvement is driven by the core innovation or by other implementation choices.
Authors: We acknowledge that the current manuscript lacks ablations isolating the role of dependency information and the hand-defined criteria. In the revised version, we will add a dedicated ablation subsection that reports correlation results for: (a) ReFEree without dependency context, (b) variants using only generic (non-code-specific) inconsistency criteria, and (c) sensitivity tests varying the aggregation weights. These experiments will quantify the incremental contribution of each element to the 15-18% improvement over baselines. revision: yes
Circularity Check
No circularity; empirical correlation reported against independently constructed human benchmark
full rationale
The paper defines its own factual inconsistency criteria, computes segment-level scores using those criteria plus dependency information, aggregates them, and then reports correlation against a separately constructed benchmark containing human-annotated factual consistency labels. No equations, fitted parameters, or self-citations are shown that reduce the final correlation result or the method's output to the input criteria by construction. The human benchmark functions as an external validation set rather than a tautological re-application of the same definitions, satisfying the requirement for an independent check.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
Reference graph
Works this paper leans on
-
[1]
Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation
Increasing the speed and accuracy of data label- ing through an ai assisted interface. InProceedings of the 26th International Conference on Intelligent User Interfaces, pages 392–401. Yangruibo Ding, Zijian Wang, Wasi Ahmad, Murali Kr- ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. 2024. CoCoMIC: Code completion by jointl...
-
[3]
Read the code summary text and check if it accurately describes the code
-
[4]
1” means “{criterion} does not exist
Evaluate whether {criterion} exists, where “1” means “{criterion} does not exist” and “0” means “{criterion} exists” based on the Evaluation Criteria. [User Prompt] ## CODE: (Related Information) {related_information} (Input Code) {input_code} ## SUMMARY TEXT: {segment} ## SCORE (score only): B Additional Details About Experimental Setups B.1 More Details...
-
[5]
<Demonstration> ## code:{code} ## Hallucinated Summary:{summary}
The name of a function, class, or variable mentioned in the text does not match the actual identifier used in the code. <Demonstration> ## code:{code} ## Hallucinated Summary:{summary}
-
[6]
<Demonstration>
The described return type or variable type in the text is inconsistent with the actual type used or inferred in the code. <Demonstration>
-
[7]
<Demonstration>
The described functionality or purpose of the code in the text does not accurately reflect what the Python code actually implements. <Demonstration>
-
[8]
<Demonstration> You should try your best to make the halluci- nated summary
The described text contains content that is unnecessary or unrelated to the input code. <Demonstration> You should try your best to make the halluci- nated summary. Provide ONLY the summary texts. Do not include any other codes or notes. [User Prompt] ## CODE: (Related Information) {related_information} (Input Code) {input_code} ### HALLUCINATED SUMMARY: ...
2015
-
[9]
ROUGE (Lin, 2004)ROUGE measures the over- lap of n-grams between the generated output and reference summaries
Reference-based methods:We use the En- glish descriptions as reference summaries. ROUGE (Lin, 2004)ROUGE measures the over- lap of n-grams between the generated output and reference summaries. In this paper, we use ROUGE-1/2/L f1 score for baselines. BLEU (Papineni et al., 2002)measures the n-gram precision between the generated text and references with a...
2004
-
[10]
Since LLM-Judge, G-Eval, and FactScore are originally designed as evaluation methods for the NLP do- main, we modify their prompts to ensure applica- bility to the code domain
Reference-free methods:We evaluate the overall factual consistency between the in- put code and the entire summary with 5 baselines using the same LLM, GPT-4.1-mini (gpt-4.1-mini-2025-04-14), and the hyperpa- rameters are set as follows: temperature = 0.1, top-p = 0.9, top-k = 50, and max new tokens = 4. Since LLM-Judge, G-Eval, and FactScore are original...
2025
-
[13]
Please breakdown the following sentence into independent facts
Assign a score for factually consistency on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria. [User Prompt] ## CODE: {input_code} ## SUMMARY TEXT: {summary} ## SCORE (score only): Factscore (Min et al., 2023)This method is a fine- grained method proposed in the NLP fields that breaks a generation into a series...
2023
-
[14]
Read the CODE carefully and understand its main intent
-
[15]
Read the code summary and check if it accurately describe the code
-
[16]
Assign a score for factually consistency on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria. [User Prompt] ## CODE: {input_code} ## SUMMARY TEXT: {summary} ## SCORE (score only): B.3 Implementation Details Our method supports various LLMs, including both closed-source and open-source models, as segment- level...
-
[17]
Our analysis shows that most factual inconsis- tencies can be correctly determined based on in- formation from directly invoked entities (depth-1)
-
[18]
When expanding retrieval to 2-hop or deeper, the additional information such as transitive depen- dencies, internal implementation details or indirect call-chain information typically includes. However, our experimental results show that this additional Context settingr p rs τAverage 0-hop (w/o info) 0.432 0.432 0.349 0.404 1-hop (ours) 0.497 0.489 0.390 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.