LLM-Based Code Documentation Generation and Multi-Judge Evaluation
Pith reviewed 2026-06-30 22:36 UTC · model grok-4.3
The pith
An LLM framework generates code documentation and uses multiple LLMs as judges to measure quality differences across models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that a PocketFlow-based pipeline using eight state-of-the-art LLMs produces context-aware code documentation, and that a MultiLLMasJudges framework in which four independent LLMs evaluate outputs across nine criteria can quantify quality differences, as shown by a 42 percent gap between top and bottom models on a medical physics library.
What carries the argument
The MultiLLMasJudges evaluation framework, in which four LLMs independently score generated documentation on nine quality criteria to guide model selection and combination.
If this is right
- Combining outputs from multiple generation models raises overall documentation quality.
- Optimized prompting within the orchestration pipeline improves the structured output.
- The multi-judge scores can be used to select or ensemble the best-performing models for new codebases.
- The approach reduces the volume of manual documentation effort needed for safety-critical healthcare software.
Where Pith is reading between the lines
- The same multi-judge structure could be applied to evaluate documentation in non-medical domains where reliability matters.
- If LLM judges prove consistent, the framework could serve as a lightweight first filter before any human review.
- Extending the nine criteria or the judge pool might change the observed 42 percent gap and therefore model rankings.
Load-bearing premise
LLM judges can reliably rate documentation quality on criteria such as faithfulness without any human validation or reference ground truth.
What would settle it
A side-by-side comparison in which human experts rate the same set of generated documents and the agreement rate with the four-LLM judges falls below a pre-specified threshold.
read the original abstract
High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates documentation generation from code and repositories using eight state of the art Large Language Models (LLMs), including GPT, Gemini, Qwen, and LLaMA variants. Built on the PocketFlow orchestration framework, the system applies modular pipelines and advanced prompt engineering to produce structured, context aware documentation. To ensure quality and guide model selection, we introduced a MultiLLMasJudges evaluation framework, where four independent LLMs assess outputs across nine criteria, such as Completeness, Clarity, and Faithfulness. Experiments conducted on an open-source medical physics library, demonstrated showed a 42% performance gap between top and bottom models. By combining diverse model outputs, optimized prompting, and rigorous evaluation, our approach enhances documentation quality and reduces manual effort, especially in safety critical healthcare software.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an AI-powered framework using eight state-of-the-art LLMs (including GPT, Gemini, Qwen, and LLaMA variants) orchestrated via PocketFlow with modular pipelines and prompt engineering to generate structured code documentation from repositories. It introduces a MultiLLM-as-Judges evaluation framework in which four independent LLMs score outputs on nine criteria (e.g., Completeness, Clarity, Faithfulness). Experiments on an open-source medical physics library report a 42% performance gap between top and bottom models, with the goal of improving documentation quality and reducing manual effort in safety-critical healthcare software.
Significance. If the LLM judges are shown to correlate with human assessments, the framework could meaningfully reduce documentation burden in regulated domains. The choice of an open-source medical-physics library and the use of multiple generator and judge models are positive steps toward reproducibility. However, the absence of any human validation or ground-truth calibration for the nine-criteria scoring directly limits the interpretability of the headline 42% gap.
major comments (2)
- [Abstract] Abstract (performance-gap claim): The reported 42% gap is produced entirely by four LLM judges scoring subjective criteria including Faithfulness and Completeness; no inter-judge agreement statistics, correlation with human raters, or calibration against existing library documentation are described. This evaluation method is load-bearing for the central claim that the approach enhances documentation quality.
- [MultiLLM-as-Judges framework] Evaluation framework description: The MultiLLM-as-Judges setup risks circularity because LLM outputs are judged by other LLMs on the same subjective dimensions without independent human validation or ground-truth comparison, directly undermining the reliability of model rankings.
minor comments (1)
- [Abstract] Abstract: The phrase 'demonstrated showed' is a grammatical error and should be corrected.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We respond point-by-point to the major comments and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract (performance-gap claim): The reported 42% gap is produced entirely by four LLM judges scoring subjective criteria including Faithfulness and Completeness; no inter-judge agreement statistics, correlation with human raters, or calibration against existing library documentation are described. This evaluation method is load-bearing for the central claim that the approach enhances documentation quality.
Authors: We agree that the 42% gap rests entirely on LLM judges without reported inter-judge agreement, human correlation, or calibration to existing documentation. This limits the strength of claims about quality improvement. We will revise the abstract to qualify the performance-gap statement and add a limitations subsection that reports any inter-judge agreement statistics computable from the existing judge outputs. revision: yes
-
Referee: [MultiLLM-as-Judges framework] Evaluation framework description: The MultiLLM-as-Judges setup risks circularity because LLM outputs are judged by other LLMs on the same subjective dimensions without independent human validation or ground-truth comparison, directly undermining the reliability of model rankings.
Authors: The risk of circularity is a substantive concern. Although we used separate model sets for generation and judging, the absence of human validation or ground-truth comparison does weaken the interpretability of the rankings. We will expand the framework section to explicitly discuss this limitation and frame the multi-judge approach as an initial, scalable proxy rather than a validated substitute for human assessment. revision: partial
- We lack human rater data and therefore cannot supply correlation statistics or calibration against existing library documentation.
Circularity Check
No significant circularity; empirical results from LLM judges are measurements, not forced by construction
full rationale
The paper presents an empirical framework for LLM-based documentation generation followed by evaluation via four separate LLMs scoring nine criteria. The reported 42% gap is an observed experimental outcome on the medical-physics library, not a mathematical derivation or fitted parameter. No equations, self-definitional loops, parameter-fitting steps renamed as predictions, or load-bearing self-citations appear in the provided text. The evaluation method is applied to produce the metrics rather than presupposing them, satisfying the requirement for independent content in the central claim.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can generate accurate and context-aware code documentation when given appropriate prompts.
- domain assumption Multi-LLM judging provides a reliable proxy for documentation quality.
Reference graph
Works this paper leans on
-
[1]
Dealing with risk in scientific software development,
R. Sanders and D. Kelly, “Dealing with risk in scientific software development,”IEEE Software, vol. 25, no. 4, pp. 21–28, 2008. [Online]. Available: http://doi.org/10.1109/MS.2008.84
-
[2]
Complementing software documentation,
P. van der Spek, S. Klusener, and P. van de Laar, “Complementing software documentation,”Views on evolvability of embedded systems, pp. 37–51, 2011. [Online]. Available: https://doi.org/10.1007/978-90-481-9849-8 3
-
[3]
Detecting outdated code element references in software repository documentation,
W. S. Tan, M. Wagner, and C. Treude, “Detecting outdated code element references in software repository documentation,”Empirical Software Engineering, vol. 29, no. 1, p. 5, 2024. [Online]. Available: https://doi.org/10.1007/s10664-023-10397-6
-
[4]
Winters, T
T. Winters, T. Manshreck, and H. Wright,Software engineering at google: Lessons learned from programming over time. ” O’Reilly Media, Inc.”, 2020
2020
-
[5]
A survey on evaluating large language models in code generation tasks,
L. Chen, Q. Guo, H. Jia, Z. Zeng, X. Wang, Y . Xu, J. Wu, Y . Wang, Q. Gao, J. Wang, W. Ye, and S. Zhang, “A survey on evaluating large language models in code generation tasks,”Preprint arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2408.16498
-
[6]
B. Dagenais and M. P. Robillard, “Creating and evolving developer documentation: understanding the decisions of open source contributors,” inProceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 127–136. [Online]. Available: ...
-
[7]
Large language models (LLMs) for source code analysis: Applications, models and datasets,
H. Jelodar, M. Meymani, and R. Razavi-Far, “Large language models (LLMs) for source code analysis: Applications, models and datasets,”Preprint arXiv, 2025. [Online]. Available: https: //arxiv.org/abs/2503.17502
-
[8]
(2024) AI-Powered code documentation tool
DocuWriter.ai. (2024) AI-Powered code documentation tool. [Online]. Available: https://www.docuwriter.ai/
2024
-
[9]
(2024) Beautiful docs for your codebase
Mintlify. (2024) Beautiful docs for your codebase. [Online]. Available: https://mintlify.com/
2024
-
[10]
(2024) AI based software automation
Workik. (2024) AI based software automation. [Online]. Available: https://workik.com/
2024
-
[11]
Free and customizable code documentation with llms: A fine-tuning approach,
S. Chakrabarty and S. Pal, “Free and customizable code documentation with llms: A fine-tuning approach,”Preprint arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2412.00726
-
[12]
Flow state: Humans enabling AI systems to program themselves,
H. Zhang, J. Haskell, and Y . Frost, “Flow state: Humans enabling AI systems to program themselves,”Preprint arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2504.03771
-
[13]
Are large language models reliable judges? a study on the factuality evaluation capabilities of LLMs,
X.-Y . Fu, M. T. R. Laskar, C. Chen, and S. B. Tn, “Are large language models reliable judges? a study on the factuality evaluation capabilities of LLMs,” inProceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), S. Gehrmann, A. Wang, J. Sedoc, E. Clark, K. Dhole, K. R. Chandu, E. Santus, and H. Sedghamiz, Eds. Sin...
2023
-
[14]
Evaluation of quantitative and qualitative metrics for assessing hallucination phenomena in large language models,
M. F. Khan, “Evaluation of quantitative and qualitative metrics for assessing hallucination phenomena in large language models,”Nuvern Applied Science Reviews, vol. 8, no. 10, pp. 38–47, 2024. [Online]. Available: https://nuvern.com/index.php/nasr/article/view/2024-10-16
2024
-
[15]
A review of faithfulness metrics for hallucination assessment in large language models,
B. Malin, T. Kalganova, and N. Boulgouris, “A review of faithfulness metrics for hallucination assessment in large language models,”IEEE Journal of Selected Topics in Signal Processing, 2025. [Online]. Available: http://doi.org/10.1109/JSTSP.2025.3579203
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.