arxiv: 2604.10520 · v1 · submitted 2026-04-12 · 💻 cs.CL · cs.AI· cs.PL

Recognition: unknown

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

Suyoung Bae , CheolWon Na , Jaehoon Lee , Yumin Lee , YunSeok Choi , Jee-Hyong Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.PL

keywords factual consistencycode summarizationreference-free evaluationfine-grained scoringLLM evaluationhuman correlationsoftware documentationsegment-level analysis

0 comments

The pith

ReFEree checks factual accuracy in long real-world code summaries without references by scoring inconsistencies at the segment level with dependency information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models now produce detailed multi-sentence summaries of entire codebases, yet judging whether those summaries stay faithful to the code has relied on tools built only for short isolated snippets. ReFEree defines concrete inconsistency types that arise in code descriptions and measures them piece by piece, bringing in the dependency links between segments to catch context errors. The segment scores are combined into one overall rating. The authors created a human-annotated benchmark of real code summaries and showed that this approach matches human judgments more closely than thirteen existing methods, lifting correlation by 15 to 18 percent over the prior best result. Reliable automatic checks matter because they let researchers iterate on summarizers without constant manual verification.

Core claim

ReFEree defines factual inconsistency criteria specific to code summaries and evaluates them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. On a newly constructed benchmark containing human-annotated factual consistency labels for real-world code summaries, ReFEree records the highest correlation with human judgment among thirteen baselines.

What carries the argument

Segment-level evaluation that applies code-specific factual inconsistency criteria together with dependency checks before aggregation into an overall score.

If this is right

ReFEree supplies both an overall score and per-segment diagnoses that identify exactly which parts of a summary contain inconsistencies.
The method works without any reference summary, removing the need for gold-standard text that is often unavailable for real codebases.
The new benchmark dataset enables direct comparison of future evaluation methods on the same human-labeled real-world examples.
Higher correlation with humans means automatic scores can more reliably guide selection or fine-tuning of code summarization models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams building production code-documentation tools could embed the segment-level checker to flag risky summaries before they reach users.
The same dependency-aware segmentation idea might transfer to evaluating long-form outputs in related tasks such as commit-message generation or API documentation.
Because the inconsistency criteria are manually specified, they will require periodic review whenever common coding patterns or documentation styles shift.
Researchers could test whether replacing the fixed criteria with learned ones trained on the human annotations further raises correlation.

Load-bearing premise

The authors' hand-defined list of factual inconsistency types for code summaries, combined with segment-level dependency checks, matches human judgments of factual consistency across diverse real codebases.

What would settle it

A fresh human annotation study on summaries drawn from many additional projects and programming languages in which ReFEree's segment scores show substantially lower agreement with the new human labels than reported on the original benchmark.

Figures

Figures reproduced from arXiv: 2604.10520 by CheolWon Na, Jaehoon Lee, Jee-Hyong Lee, Suyoung Bae, Yumin Lee, YunSeok Choi.

**Figure 2.** Figure 2: Overview of ReFEree framework. Our evaluation process first performs code-related information searching to consider dependency relations in the input code. The summary is then segmented at the sentence level, and an LLM evaluates each segment according to the four factual inconsistency criteria to obtain segment-level scores. These scores are subsequently aggregated to compute the final consistency score. … view at source ↗

**Figure 3.** Figure 3: Correlation between human judgment and the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Stability of ReFEree across summary lengths [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Human evaluation interfaces: The upper image shows the summary-level human annotation interface, while the lower image shows the segment-level interface [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReFEree gives a workable segment-level reference-free scorer for longer code summaries but its human correlation gains rest on an annotation process the abstract leaves too opaque.

read the letter

The paper's core move is to split real-world code summaries into segments, apply a set of hand-defined factual inconsistency types specific to code, fold in dependency context, and aggregate into a single score. This is a direct response to the limits of earlier metrics that only handled short isolated snippets. They also release a new human-annotated benchmark and show the method beats 13 baselines by 15-18% in correlation with those labels. That extension to multi-sentence summaries with dependencies is the clearest advance over the cited prior work, and shipping the code and data is a practical plus for anyone who wants to try it out or build on the benchmark.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world LLM-generated code summaries. It hand-defines factual inconsistency criteria specific to code, scores summaries at the segment level using these criteria plus dependency information, aggregates to an overall score, constructs a new human-annotated benchmark, and reports that ReFEree achieves the highest correlation with human judgments among 13 baselines with a 15-18% improvement over prior state-of-the-art.

Significance. If the correlation gains hold after addressing potential circularity between the hand-defined criteria and the annotation process, the work would meaningfully advance evaluation of code summarization by handling longer, dependency-rich summaries that prior reference-based or snippet-level metrics cannot address. The public release of code and data is a positive contribution to reproducibility in this area.

major comments (2)

[Abstract and benchmark construction section] The abstract and benchmark construction section do not describe the human annotation guidelines, inter-annotator agreement, or whether annotators were shown or primed with the same factual inconsistency criteria defined for ReFEree. Because the central claim rests on superior correlation with these human labels, any overlap would render the 15-18% improvement partly tautological rather than an independent validation.
[Method section] The method section defines inconsistency criteria and segment-level scoring but provides no ablation or sensitivity analysis showing that the reported gains require the dependency information or the specific criteria; without this, it is unclear whether the improvement is driven by the core innovation or by other implementation choices.

minor comments (2)

[Abstract] The abstract states 'improving 15-18% over the previous state-of-the-art' without naming the exact correlation coefficient (Pearson, Spearman, etc.) or identifying which of the 13 baselines constitutes the prior SOTA.
[Results section] Figure and table captions should explicitly state the number of summaries, codebases, and annotators in the benchmark to allow readers to assess scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of transparency in our benchmark and the need to better isolate the contributions of our method components. We will revise the manuscript to address both major comments fully.

read point-by-point responses

Referee: [Abstract and benchmark construction section] The abstract and benchmark construction section do not describe the human annotation guidelines, inter-annotator agreement, or whether annotators were shown or primed with the same factual inconsistency criteria defined for ReFEree. Because the central claim rests on superior correlation with these human labels, any overlap would render the 15-18% improvement partly tautological rather than an independent validation.

Authors: We agree that these details are necessary to establish the independence of the human labels. In the revised manuscript, we will expand the benchmark construction section with: (1) the complete annotation guidelines provided to annotators, (2) the inter-annotator agreement statistics (Cohen's kappa and percentage agreement), and (3) an explicit statement that annotators received no exposure to the ReFEree-specific criteria. Annotators were instead instructed to identify factual inconsistencies based solely on whether summary segments were supported by the provided code and its dependency context, using their own expertise. This will confirm that the reported correlation gains reflect genuine alignment rather than circularity. revision: yes
Referee: [Method section] The method section defines inconsistency criteria and segment-level scoring but provides no ablation or sensitivity analysis showing that the reported gains require the dependency information or the specific criteria; without this, it is unclear whether the improvement is driven by the core innovation or by other implementation choices.

Authors: We acknowledge that the current manuscript lacks ablations isolating the role of dependency information and the hand-defined criteria. In the revised version, we will add a dedicated ablation subsection that reports correlation results for: (a) ReFEree without dependency context, (b) variants using only generic (non-code-specific) inconsistency criteria, and (c) sensitivity tests varying the aggregation weights. These experiments will quantify the incremental contribution of each element to the 15-18% improvement over baselines. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical correlation reported against independently constructed human benchmark

full rationale

The paper defines its own factual inconsistency criteria, computes segment-level scores using those criteria plus dependency information, aggregates them, and then reports correlation against a separately constructed benchmark containing human-annotated factual consistency labels. No equations, fitted parameters, or self-citations are shown that reduce the final correlation result or the method's output to the input criteria by construction. The human benchmark functions as an external validation set rather than a tautological re-application of the same definitions, satisfying the requirement for an independent check.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described; the approach rests on the unstated premise that the authors' criteria capture factual inconsistency and that human annotations provide an independent ground truth.

pith-pipeline@v0.9.0 · 5513 in / 1126 out tokens · 27862 ms · 2026-05-10T15:28:00.733803+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
cs.SE 2026-05 unverdicted novelty 7.0

MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.

Reference graph

Works this paper leans on

15 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation

Increasing the speed and accuracy of data label- ing through an ai assisted interface. InProceedings of the 26th International Conference on Intelligent User Interfaces, pages 392–401. Yangruibo Ding, Zijian Wang, Wasi Ahmad, Murali Kr- ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. 2024. CoCoMIC: Code completion by jointl...

work page arXiv 2024
[3]

Read the code summary text and check if it accurately describes the code
[4]

1” means “{criterion} does not exist

Evaluate whether {criterion} exists, where “1” means “{criterion} does not exist” and “0” means “{criterion} exists” based on the Evaluation Criteria. [User Prompt] ## CODE: (Related Information) {related_information} (Input Code) {input_code} ## SUMMARY TEXT: {segment} ## SCORE (score only): B Additional Details About Experimental Setups B.1 More Details...
[5]

<Demonstration> ## code:{code} ## Hallucinated Summary:{summary}

The name of a function, class, or variable mentioned in the text does not match the actual identifier used in the code. <Demonstration> ## code:{code} ## Hallucinated Summary:{summary}
[6]

<Demonstration>

The described return type or variable type in the text is inconsistent with the actual type used or inferred in the code. <Demonstration>
[7]

<Demonstration>

The described functionality or purpose of the code in the text does not accurately reflect what the Python code actually implements. <Demonstration>
[8]

<Demonstration> You should try your best to make the halluci- nated summary

The described text contains content that is unnecessary or unrelated to the input code. <Demonstration> You should try your best to make the halluci- nated summary. Provide ONLY the summary texts. Do not include any other codes or notes. [User Prompt] ## CODE: (Related Information) {related_information} (Input Code) {input_code} ### HALLUCINATED SUMMARY: ...

2015
[9]

ROUGE (Lin, 2004)ROUGE measures the over- lap of n-grams between the generated output and reference summaries

Reference-based methods:We use the En- glish descriptions as reference summaries. ROUGE (Lin, 2004)ROUGE measures the over- lap of n-grams between the generated output and reference summaries. In this paper, we use ROUGE-1/2/L f1 score for baselines. BLEU (Papineni et al., 2002)measures the n-gram precision between the generated text and references with a...

2004
[10]

Since LLM-Judge, G-Eval, and FactScore are originally designed as evaluation methods for the NLP do- main, we modify their prompts to ensure applica- bility to the code domain

Reference-free methods:We evaluate the overall factual consistency between the in- put code and the entire summary with 5 baselines using the same LLM, GPT-4.1-mini (gpt-4.1-mini-2025-04-14), and the hyperpa- rameters are set as follows: temperature = 0.1, top-p = 0.9, top-k = 50, and max new tokens = 4. Since LLM-Judge, G-Eval, and FactScore are original...

2025
[13]

Please breakdown the following sentence into independent facts

Assign a score for factually consistency on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria. [User Prompt] ## CODE: {input_code} ## SUMMARY TEXT: {summary} ## SCORE (score only): Factscore (Min et al., 2023)This method is a fine- grained method proposed in the NLP fields that breaks a generation into a series...

2023
[14]

Read the CODE carefully and understand its main intent
[15]

Read the code summary and check if it accurately describe the code
[16]

Assign a score for factually consistency on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria. [User Prompt] ## CODE: {input_code} ## SUMMARY TEXT: {summary} ## SCORE (score only): B.3 Implementation Details Our method supports various LLMs, including both closed-source and open-source models, as segment- level...
[17]

Our analysis shows that most factual inconsis- tencies can be correctly determined based on in- formation from directly invoked entities (depth-1)
[18]

When expanding retrieval to 2-hop or deeper, the additional information such as transitive depen- dencies, internal implementation details or indirect call-chain information typically includes. However, our experimental results show that this additional Context settingr p rs τAverage 0-hop (w/o info) 0.432 0.432 0.349 0.404 1-hop (ours) 0.497 0.489 0.390 ...