LLM-Based Code Documentation Generation and Multi-Judge Evaluation

Ikbel Ghrab; Ines Abdeljaoued-Tej; Ismail Khenissi; Mohamed Dhieb

arxiv: 2606.09852 · v1 · pith:WLVCDTPVnew · submitted 2026-05-11 · 💻 cs.HC · cs.AI· cs.CL· cs.LG· cs.MA· cs.SE

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

Ikbel Ghrab , Mohamed Dhieb , Ismail Khenissi , Ines Abdeljaoued-Tej This is my paper

Pith reviewed 2026-06-30 22:36 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CLcs.LGcs.MAcs.SE

keywords LLMcode documentationmulti-judge evaluationprompt engineeringmedical physics softwarehealthcare code qualityautomated documentation

0 comments

The pith

An LLM framework generates code documentation and uses multiple LLMs as judges to measure quality differences across models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that eight different large language models can be orchestrated to produce structured documentation directly from source code and repositories. A separate MultiLLMasJudges system then deploys four additional LLMs to score each output on nine fixed criteria including completeness, clarity, and faithfulness. Experiments on an open-source medical physics library reveal a 42 percent performance spread between the strongest and weakest models. The authors argue that this combination of diverse generation, optimized prompts, and automated judging can raise overall documentation standards while cutting the amount of manual review required in safety-critical healthcare code.

Core claim

The authors claim that a PocketFlow-based pipeline using eight state-of-the-art LLMs produces context-aware code documentation, and that a MultiLLMasJudges framework in which four independent LLMs evaluate outputs across nine criteria can quantify quality differences, as shown by a 42 percent gap between top and bottom models on a medical physics library.

What carries the argument

The MultiLLMasJudges evaluation framework, in which four LLMs independently score generated documentation on nine quality criteria to guide model selection and combination.

If this is right

Combining outputs from multiple generation models raises overall documentation quality.
Optimized prompting within the orchestration pipeline improves the structured output.
The multi-judge scores can be used to select or ensemble the best-performing models for new codebases.
The approach reduces the volume of manual documentation effort needed for safety-critical healthcare software.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-judge structure could be applied to evaluate documentation in non-medical domains where reliability matters.
If LLM judges prove consistent, the framework could serve as a lightweight first filter before any human review.
Extending the nine criteria or the judge pool might change the observed 42 percent gap and therefore model rankings.

Load-bearing premise

LLM judges can reliably rate documentation quality on criteria such as faithfulness without any human validation or reference ground truth.

What would settle it

A side-by-side comparison in which human experts rate the same set of generated documents and the agreement rate with the four-LLM judges falls below a pre-specified threshold.

read the original abstract

High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates documentation generation from code and repositories using eight state of the art Large Language Models (LLMs), including GPT, Gemini, Qwen, and LLaMA variants. Built on the PocketFlow orchestration framework, the system applies modular pipelines and advanced prompt engineering to produce structured, context aware documentation. To ensure quality and guide model selection, we introduced a MultiLLMasJudges evaluation framework, where four independent LLMs assess outputs across nine criteria, such as Completeness, Clarity, and Faithfulness. Experiments conducted on an open-source medical physics library, demonstrated showed a 42% performance gap between top and bottom models. By combining diverse model outputs, optimized prompting, and rigorous evaluation, our approach enhances documentation quality and reduces manual effort, especially in safety critical healthcare software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper runs 8 LLMs on medical physics code docs then has 4 LLMs judge them on 9 criteria, reports 42% gap, but gives no human validation for the judges.

read the letter

The core of this paper is straightforward: they took an open-source medical physics library, fed its code to eight different LLMs with various prompts, generated documentation, and then had four other LLMs score the results on nine criteria including faithfulness and completeness. They report a 42% spread between the top and bottom models and claim the multi-judge setup helps pick better outputs.

What the work actually does is apply existing LLM capabilities to a concrete, safety-relevant codebase and run the full pipeline end to end. Using PocketFlow for orchestration and testing on real medical physics code is useful for anyone who needs to document similar libraries. The numbers give a rough sense of how current models differ on this task.

The main weakness is the evaluation itself. The 42% gap comes entirely from LLM judges scoring subjective qualities, yet the paper provides no human ratings, no correlation numbers, and no inter-judge agreement stats against people who actually understand the code. That leaves open the possibility that the gap measures judge-model preferences rather than documentation quality. The abstract also skips basic details on how the gap was calculated or whether differences are statistically tested.

This is for groups already working on LLM tooling for domain-specific software engineering, especially in healthcare. A reader who wants to see current model performance on documentation generation will get some data points; someone looking for a validated new evaluation method will not.

It is worth sending to peer review. The experiment is grounded enough to be worth referee time, but the authors should expect direct questions on human validation of the judges.

Referee Report

2 major / 1 minor

Summary. The paper presents an AI-powered framework using eight state-of-the-art LLMs (including GPT, Gemini, Qwen, and LLaMA variants) orchestrated via PocketFlow with modular pipelines and prompt engineering to generate structured code documentation from repositories. It introduces a MultiLLM-as-Judges evaluation framework in which four independent LLMs score outputs on nine criteria (e.g., Completeness, Clarity, Faithfulness). Experiments on an open-source medical physics library report a 42% performance gap between top and bottom models, with the goal of improving documentation quality and reducing manual effort in safety-critical healthcare software.

Significance. If the LLM judges are shown to correlate with human assessments, the framework could meaningfully reduce documentation burden in regulated domains. The choice of an open-source medical-physics library and the use of multiple generator and judge models are positive steps toward reproducibility. However, the absence of any human validation or ground-truth calibration for the nine-criteria scoring directly limits the interpretability of the headline 42% gap.

major comments (2)

[Abstract] Abstract (performance-gap claim): The reported 42% gap is produced entirely by four LLM judges scoring subjective criteria including Faithfulness and Completeness; no inter-judge agreement statistics, correlation with human raters, or calibration against existing library documentation are described. This evaluation method is load-bearing for the central claim that the approach enhances documentation quality.
[MultiLLM-as-Judges framework] Evaluation framework description: The MultiLLM-as-Judges setup risks circularity because LLM outputs are judged by other LLMs on the same subjective dimensions without independent human validation or ground-truth comparison, directly undermining the reliability of model rankings.

minor comments (1)

[Abstract] Abstract: The phrase 'demonstrated showed' is a grammatical error and should be corrected.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We respond point-by-point to the major comments and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract (performance-gap claim): The reported 42% gap is produced entirely by four LLM judges scoring subjective criteria including Faithfulness and Completeness; no inter-judge agreement statistics, correlation with human raters, or calibration against existing library documentation are described. This evaluation method is load-bearing for the central claim that the approach enhances documentation quality.

Authors: We agree that the 42% gap rests entirely on LLM judges without reported inter-judge agreement, human correlation, or calibration to existing documentation. This limits the strength of claims about quality improvement. We will revise the abstract to qualify the performance-gap statement and add a limitations subsection that reports any inter-judge agreement statistics computable from the existing judge outputs. revision: yes
Referee: [MultiLLM-as-Judges framework] Evaluation framework description: The MultiLLM-as-Judges setup risks circularity because LLM outputs are judged by other LLMs on the same subjective dimensions without independent human validation or ground-truth comparison, directly undermining the reliability of model rankings.

Authors: The risk of circularity is a substantive concern. Although we used separate model sets for generation and judging, the absence of human validation or ground-truth comparison does weaken the interpretability of the rankings. We will expand the framework section to explicitly discuss this limitation and frame the multi-judge approach as an initial, scalable proxy rather than a validated substitute for human assessment. revision: partial

standing simulated objections not resolved

We lack human rater data and therefore cannot supply correlation statistics or calibration against existing library documentation.

Circularity Check

0 steps flagged

No significant circularity; empirical results from LLM judges are measurements, not forced by construction

full rationale

The paper presents an empirical framework for LLM-based documentation generation followed by evaluation via four separate LLMs scoring nine criteria. The reported 42% gap is an observed experimental outcome on the medical-physics library, not a mathematical derivation or fitted parameter. No equations, self-definitional loops, parameter-fitting steps renamed as predictions, or load-bearing self-citations appear in the provided text. The evaluation method is applied to produce the metrics rather than presupposing them, satisfying the requirement for independent content in the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on assumptions about LLM capabilities in code understanding and evaluation, which are not independently verified in the abstract.

axioms (2)

domain assumption LLMs can generate accurate and context-aware code documentation when given appropriate prompts.
Central to the framework's ability to produce useful outputs.
domain assumption Multi-LLM judging provides a reliable proxy for documentation quality.
Underpins the evaluation framework and the reported performance gap.

pith-pipeline@v0.9.1-grok · 5714 in / 1393 out tokens · 34552 ms · 2026-06-30T22:36:57.458405+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 9 canonical work pages

[1]

Dealing with risk in scientific software development,

R. Sanders and D. Kelly, “Dealing with risk in scientific software development,”IEEE Software, vol. 25, no. 4, pp. 21–28, 2008. [Online]. Available: http://doi.org/10.1109/MS.2008.84

work page doi:10.1109/ms.2008.84 2008
[2]

Complementing software documentation,

P. van der Spek, S. Klusener, and P. van de Laar, “Complementing software documentation,”Views on evolvability of embedded systems, pp. 37–51, 2011. [Online]. Available: https://doi.org/10.1007/978-90-481-9849-8 3

work page doi:10.1007/978-90-481-9849-8 2011
[3]

Detecting outdated code element references in software repository documentation,

W. S. Tan, M. Wagner, and C. Treude, “Detecting outdated code element references in software repository documentation,”Empirical Software Engineering, vol. 29, no. 1, p. 5, 2024. [Online]. Available: https://doi.org/10.1007/s10664-023-10397-6

work page doi:10.1007/s10664-023-10397-6 2024
[4]

Winters, T

T. Winters, T. Manshreck, and H. Wright,Software engineering at google: Lessons learned from programming over time. ” O’Reilly Media, Inc.”, 2020

2020
[5]

A survey on evaluating large language models in code generation tasks,

L. Chen, Q. Guo, H. Jia, Z. Zeng, X. Wang, Y . Xu, J. Wu, Y . Wang, Q. Gao, J. Wang, W. Ye, and S. Zhang, “A survey on evaluating large language models in code generation tasks,”Preprint arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2408.16498

work page arXiv 2025
[6]

Creating and evolving developer documentation: understanding the decisions of open source contributors,

B. Dagenais and M. P. Robillard, “Creating and evolving developer documentation: understanding the decisions of open source contributors,” inProceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 127–136. [Online]. Available: ...

work page doi:10.1145/1882291.1882312 2010
[7]

Large language models (LLMs) for source code analysis: Applications, models and datasets,

H. Jelodar, M. Meymani, and R. Razavi-Far, “Large language models (LLMs) for source code analysis: Applications, models and datasets,”Preprint arXiv, 2025. [Online]. Available: https: //arxiv.org/abs/2503.17502

work page arXiv 2025
[8]

(2024) AI-Powered code documentation tool

DocuWriter.ai. (2024) AI-Powered code documentation tool. [Online]. Available: https://www.docuwriter.ai/

2024
[9]

(2024) Beautiful docs for your codebase

Mintlify. (2024) Beautiful docs for your codebase. [Online]. Available: https://mintlify.com/

2024
[10]

(2024) AI based software automation

Workik. (2024) AI based software automation. [Online]. Available: https://workik.com/

2024
[11]

Free and customizable code documentation with llms: A fine-tuning approach,

S. Chakrabarty and S. Pal, “Free and customizable code documentation with llms: A fine-tuning approach,”Preprint arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2412.00726

work page arXiv 2024
[12]

Flow state: Humans enabling AI systems to program themselves,

H. Zhang, J. Haskell, and Y . Frost, “Flow state: Humans enabling AI systems to program themselves,”Preprint arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2504.03771

work page arXiv 2025
[13]

Are large language models reliable judges? a study on the factuality evaluation capabilities of LLMs,

X.-Y . Fu, M. T. R. Laskar, C. Chen, and S. B. Tn, “Are large language models reliable judges? a study on the factuality evaluation capabilities of LLMs,” inProceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), S. Gehrmann, A. Wang, J. Sedoc, E. Clark, K. Dhole, K. R. Chandu, E. Santus, and H. Sedghamiz, Eds. Sin...

2023
[14]

Evaluation of quantitative and qualitative metrics for assessing hallucination phenomena in large language models,

M. F. Khan, “Evaluation of quantitative and qualitative metrics for assessing hallucination phenomena in large language models,”Nuvern Applied Science Reviews, vol. 8, no. 10, pp. 38–47, 2024. [Online]. Available: https://nuvern.com/index.php/nasr/article/view/2024-10-16

2024
[15]

A review of faithfulness metrics for hallucination assessment in large language models,

B. Malin, T. Kalganova, and N. Boulgouris, “A review of faithfulness metrics for hallucination assessment in large language models,”IEEE Journal of Selected Topics in Signal Processing, 2025. [Online]. Available: http://doi.org/10.1109/JSTSP.2025.3579203

work page doi:10.1109/jstsp.2025.3579203 2025

[1] [1]

Dealing with risk in scientific software development,

R. Sanders and D. Kelly, “Dealing with risk in scientific software development,”IEEE Software, vol. 25, no. 4, pp. 21–28, 2008. [Online]. Available: http://doi.org/10.1109/MS.2008.84

work page doi:10.1109/ms.2008.84 2008

[2] [2]

Complementing software documentation,

P. van der Spek, S. Klusener, and P. van de Laar, “Complementing software documentation,”Views on evolvability of embedded systems, pp. 37–51, 2011. [Online]. Available: https://doi.org/10.1007/978-90-481-9849-8 3

work page doi:10.1007/978-90-481-9849-8 2011

[3] [3]

Detecting outdated code element references in software repository documentation,

W. S. Tan, M. Wagner, and C. Treude, “Detecting outdated code element references in software repository documentation,”Empirical Software Engineering, vol. 29, no. 1, p. 5, 2024. [Online]. Available: https://doi.org/10.1007/s10664-023-10397-6

work page doi:10.1007/s10664-023-10397-6 2024

[4] [4]

Winters, T

T. Winters, T. Manshreck, and H. Wright,Software engineering at google: Lessons learned from programming over time. ” O’Reilly Media, Inc.”, 2020

2020

[5] [5]

A survey on evaluating large language models in code generation tasks,

L. Chen, Q. Guo, H. Jia, Z. Zeng, X. Wang, Y . Xu, J. Wu, Y . Wang, Q. Gao, J. Wang, W. Ye, and S. Zhang, “A survey on evaluating large language models in code generation tasks,”Preprint arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2408.16498

work page arXiv 2025

[6] [6]

Creating and evolving developer documentation: understanding the decisions of open source contributors,

B. Dagenais and M. P. Robillard, “Creating and evolving developer documentation: understanding the decisions of open source contributors,” inProceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 127–136. [Online]. Available: ...

work page doi:10.1145/1882291.1882312 2010

[7] [7]

Large language models (LLMs) for source code analysis: Applications, models and datasets,

H. Jelodar, M. Meymani, and R. Razavi-Far, “Large language models (LLMs) for source code analysis: Applications, models and datasets,”Preprint arXiv, 2025. [Online]. Available: https: //arxiv.org/abs/2503.17502

work page arXiv 2025

[8] [8]

(2024) AI-Powered code documentation tool

DocuWriter.ai. (2024) AI-Powered code documentation tool. [Online]. Available: https://www.docuwriter.ai/

2024

[9] [9]

(2024) Beautiful docs for your codebase

Mintlify. (2024) Beautiful docs for your codebase. [Online]. Available: https://mintlify.com/

2024

[10] [10]

(2024) AI based software automation

Workik. (2024) AI based software automation. [Online]. Available: https://workik.com/

2024

[11] [11]

Free and customizable code documentation with llms: A fine-tuning approach,

S. Chakrabarty and S. Pal, “Free and customizable code documentation with llms: A fine-tuning approach,”Preprint arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2412.00726

work page arXiv 2024

[12] [12]

Flow state: Humans enabling AI systems to program themselves,

H. Zhang, J. Haskell, and Y . Frost, “Flow state: Humans enabling AI systems to program themselves,”Preprint arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2504.03771

work page arXiv 2025

[13] [13]

Are large language models reliable judges? a study on the factuality evaluation capabilities of LLMs,

X.-Y . Fu, M. T. R. Laskar, C. Chen, and S. B. Tn, “Are large language models reliable judges? a study on the factuality evaluation capabilities of LLMs,” inProceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), S. Gehrmann, A. Wang, J. Sedoc, E. Clark, K. Dhole, K. R. Chandu, E. Santus, and H. Sedghamiz, Eds. Sin...

2023

[14] [14]

Evaluation of quantitative and qualitative metrics for assessing hallucination phenomena in large language models,

M. F. Khan, “Evaluation of quantitative and qualitative metrics for assessing hallucination phenomena in large language models,”Nuvern Applied Science Reviews, vol. 8, no. 10, pp. 38–47, 2024. [Online]. Available: https://nuvern.com/index.php/nasr/article/view/2024-10-16

2024

[15] [15]

A review of faithfulness metrics for hallucination assessment in large language models,

B. Malin, T. Kalganova, and N. Boulgouris, “A review of faithfulness metrics for hallucination assessment in large language models,”IEEE Journal of Selected Topics in Signal Processing, 2025. [Online]. Available: http://doi.org/10.1109/JSTSP.2025.3579203

work page doi:10.1109/jstsp.2025.3579203 2025