Since LLM-Judge, G-Eval, and FactScore are originally designed as evaluation methods for the NLP do- main, we modify their prompts to ensure applica- bility to the code domain

Reference-free methods:We evaluate the overall factual consistency between the input code, the entire summary with 5 baselines using the same LLM, GPT-4 · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

ReFEree evaluates factual consistency in real-world code summaries at segment level using reference-free criteria and dependency context, achieving 15-18% higher correlation with human judgments than prior state-of-the-art methods on a new benchmark.

citing papers explorer

Showing 1 of 1 citing paper.

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization cs.CL · 2026-04-12 · unverdicted · none · ref 10
ReFEree evaluates factual consistency in real-world code summaries at segment level using reference-free criteria and dependency context, achieving 15-18% higher correlation with human judgments than prior state-of-the-art methods on a new benchmark.

Since LLM-Judge, G-Eval, and FactScore are originally designed as evaluation methods for the NLP do- main, we modify their prompts to ensure applica- bility to the code domain

fields

years

verdicts

representative citing papers

citing papers explorer