Deterministic vs. Probabilistic Summarisation: An Empirical Trade-off Study in Design Pattern Centric Java Code
Pith reviewed 2026-05-22 05:14 UTC · model grok-4.3
The pith
Probabilistic code summarizers deliver richer semantic context than deterministic ones, which produce shorter and fully reproducible outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study shows that probabilistic summaries generated by the Mixtral LLM achieve stronger semantic alignment and broader contextual coverage than those from a rule-based NLG pipeline or a SWUM-based approach. Deterministic methods, by contrast, consistently produce more concise and exactly reproducible summaries. These differences hold across automated similarity metrics and rubric judgments, with relative trends remaining stable despite prompt sensitivity in the probabilistic case.
What carries the argument
The empirical comparison of a rule-based natural language generation pipeline, a Software Word Usage Model approach, and an LLM pipeline using Mixtral on 150 design-pattern Java files, evaluated through BERTScore, cosine similarity, and five-dimension rubric scoring by Llama 3.
If this is right
- Teams prioritizing semantic depth and contextual accuracy in code summaries should favor probabilistic LLM pipelines.
- Teams needing brief, identical outputs across runs should select deterministic rule-based or SWUM pipelines.
- LLM-based summarization requires multiple runs or prompt tuning to manage output variability.
- The identified trade-off supplies concrete selection rules for choosing summarization techniques in documentation tools.
Where Pith is reading between the lines
- Hybrid pipelines that combine deterministic brevity with selective probabilistic enrichment could mitigate the observed limitations of each approach.
- Repeating the comparison on non-Java languages or non-design-pattern code would test whether the trade-off generalizes beyond the current testbed.
- Replacing the LLM judge with human evaluators on the same summaries would provide a direct check on the reliability of the rubric proxy used here.
Load-bearing premise
That rubric scores assigned by Llama 3 across accuracy, conciseness, adequacy, code-context awareness, and design-pattern fidelity serve as a reliable stand-in for human expert judgments of summary quality.
What would settle it
Direct human ratings in which experts assign higher accuracy or context scores to the deterministic summaries than to the probabilistic ones would undermine the claim that probabilistic methods hold an advantage in semantic depth.
read the original abstract
Background: Automated code summarisation supports program comprehension and documentation, yet the relative strengths and limitations of deterministic (heuristic-based) and probabilistic (LLM-based) pipelines remain unclear. Aims: This paper presents a controlled empirical comparison of these paradigms for intent-oriented design-pattern code summarisation. Method: Using design-pattern-centric Java code as a structured testbed (150 files from three open-source repositories covering nine patterns), we compare a rule-based natural language generation (NLG) pipeline, a Software Word Usage Model (SWUM)-based approach, and a probabilistic pipeline based on the Mixtral LLM. Summaries are evaluated against human references using BERTScore and cosine similarity, complemented by rubric-based judgements produced by Llama 3 across five dimensions: accuracy, conciseness, adequacy, code-context awareness, and design-pattern fidelity. Statistical analysis includes Wilcoxon signed-rank tests (with effect sizes), Friedman tests with post-hoc corrections, and Spearman correlation for sensitivity analysis of rubric consistency. Results: Probabilistic summaries show stronger semantic alignment and richer contextual coverage, while deterministic approaches produce more concise and fully reproducible outputs. Prompt-sensitivity and multi-run analyses indicate variability in LLM outputs, though relative trends remain stable. Conclusions: A clear trade-off emerges: probabilistic methods favour semantic depth and contextual accuracy, whereas deterministic pipelines are preferable for brevity and reproducibility. These findings provide practical guidance for selecting code summarisation techniques.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a controlled empirical comparison of deterministic (rule-based NLG and SWUM-based) versus probabilistic (Mixtral LLM) pipelines for intent-oriented summarization of design-pattern-centric Java code. It uses a corpus of 150 files from three open-source repositories covering nine patterns, evaluating summaries against human references via BERTScore and cosine similarity, supplemented by Llama 3 rubric scores on five dimensions (accuracy, conciseness, adequacy, code-context awareness, design-pattern fidelity). Statistical tests include Wilcoxon signed-rank (with effect sizes), Friedman with post-hoc corrections, and Spearman correlation. The central claim is a clear trade-off: probabilistic methods favor semantic depth and contextual accuracy, while deterministic pipelines are preferable for brevity and reproducibility.
Significance. If the results hold after addressing validation gaps, the work offers practical guidance for selecting summarization techniques in software engineering contexts such as documentation and program comprehension. The structured testbed using design patterns is a methodological strength, as is the combination of automatic metrics with rubric evaluation and the examination of LLM output variability. These elements could inform tool selection where reproducibility or semantic richness is prioritized.
major comments (2)
- [Evaluation section (rubric-based judgements)] Evaluation section (rubric-based judgements): The manuscript relies on Llama 3 rubric scores across the five dimensions as a primary proxy for human quality judgments to support the trade-off claim, but reports no human-Llama 3 correlation study, inter-rater agreement metrics, or calibration against human coders on the 150-file corpus. This is load-bearing because one comparator is itself an LLM (Mixtral), raising the risk that any systematic bias in Llama 3 could inflate the apparent advantage of the probabilistic arm without reflecting genuine summary quality.
- [Results section (statistical reporting)] Results section (statistical reporting): The abstract and results describe stronger semantic alignment for probabilistic summaries via BERTScore/cosine and rubric scores, yet the manuscript does not report raw metric values, exact p-values, or effect sizes from the Wilcoxon signed-rank and Friedman tests for each dimension. Without these, the practical magnitude and reliability of the claimed trade-off cannot be fully assessed.
minor comments (2)
- [Abstract] The abstract refers to 'three open-source repositories' without naming them; adding the repository identifiers would enhance reproducibility.
- [Throughout] Acronyms such as SWUM and NLG appear without consistent initial expansion on first use in all sections.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important aspects of evaluation validity and statistical transparency that we will address in the revision to strengthen the work.
read point-by-point responses
-
Referee: Evaluation section (rubric-based judgements): The manuscript relies on Llama 3 rubric scores across the five dimensions as a primary proxy for human quality judgments to support the trade-off claim, but reports no human-Llama 3 correlation study, inter-rater agreement metrics, or calibration against human coders on the 150-file corpus. This is load-bearing because one comparator is itself an LLM (Mixtral), raising the risk that any systematic bias in Llama 3 could inflate the apparent advantage of the probabilistic arm without reflecting genuine summary quality.
Authors: We agree that the absence of direct validation between Llama 3 rubric scores and human judgments represents a limitation, especially when evaluating outputs from another LLM. Our design used Llama 3 to enable scalable, consistent assessment across the full corpus where full human annotation would be resource-intensive. In the revised manuscript we will add a calibration subsection reporting results from a human study on a stratified subset of 30 files (balanced across patterns and repositories). Human raters will apply the same five-dimension rubric, and we will report Spearman correlations, mean absolute differences, and any inter-rater agreement statistics between Llama 3 and human scores. This addition will allow readers to assess the degree of alignment and any potential bias. revision: yes
-
Referee: Results section (statistical reporting): The abstract and results describe stronger semantic alignment for probabilistic summaries via BERTScore/cosine and rubric scores, yet the manuscript does not report raw metric values, exact p-values, or effect sizes from the Wilcoxon signed-rank and Friedman tests for each dimension. Without these, the practical magnitude and reliability of the claimed trade-off cannot be fully assessed.
Authors: We accept that fuller numerical reporting is necessary for readers to judge effect magnitude and statistical reliability. The revised Results section will include supplementary tables presenting (i) mean and standard deviation for every automatic metric and rubric dimension, (ii) exact (Bonferroni-corrected) p-values, and (iii) effect sizes (rank-biserial correlation for Wilcoxon tests and Kendall’s W for Friedman tests) for all pairwise and omnibus comparisons. These values will be cross-referenced in the text so that the practical significance of the reported trade-offs can be directly evaluated. revision: yes
Circularity Check
No circularity: empirical comparison rests on external human references and standard metrics
full rationale
The paper performs a direct empirical comparison of three distinct summarization pipelines (rule-based NLG, SWUM-based, Mixtral LLM) on a fixed corpus of 150 Java files. Results are derived from BERTScore and cosine similarity against human-written references, plus separate Llama 3 rubric scores on five dimensions, followed by standard non-parametric statistical tests (Wilcoxon, Friedman, Spearman). None of these steps involve self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central trade-off claim to its own inputs by construction. The evaluation chain is anchored in external benchmarks (human references and established similarity metrics) rather than internal re-use of the paper's own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Wilcoxon signed-rank test and Friedman test assumptions hold for the paired summary scores and multi-method comparisons.
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
- [4]
- [5]
-
[6]
and Hill, Emily and Sridhara, Giriprasad and Shepherd, David , title =
Pollock, Lori and Vijay-Shanker, K. and Hill, Emily and Sridhara, Giriprasad and Shepherd, David , title =. Software Engineering -- International Summer Schools, ISSSE 2009--2011 , series =. 2013 , pages =
work page 2009
-
[7]
and McMillan, Collin , title =
McBurney, Paul W. and McMillan, Collin , title =. Proceedings of the 22nd IEEE International Conference on Program Comprehension (ICPC) , year =
-
[8]
and Bosch, Nathaniel and D'Mello, Sidney , title =
Rodeghero, Paige and McMillan, Collin and McBurney, Paul W. and Bosch, Nathaniel and D'Mello, Sidney , title =. Proceedings of the 36th International Conference on Software Engineering (ICSE) , year =
-
[9]
and McMillan, Collin , title =
Rodeghero, Paige and Liu, Christopher and McBurney, Paul W. and McMillan, Collin , title =. IEEE Transactions on Software Engineering , volume =
-
[10]
and McMillan, Collin , title =
McBurney, Paul W. and McMillan, Collin , title =. IEEE Transactions on Software Engineering , volume =
-
[11]
Karas, Z. and Bansal, A. and Zhang, Y. and Li, T. and McMillan, C. and Huang, Y. , title =. ACM Transactions on Software Engineering and Methodology (TOSEM) , year =
-
[12]
Wallace, Robert and Bansal, Aakash and Karas, Zachary and Tang, Ningzhi and Huang, Yu and Jia-Jun Li, Toby and McMillan, Collin , title =. IEEE Trans. Softw. Eng. , year =
-
[13]
Zhang, C. and Wang, J. and Zhou, Q. and Xu, T. and Tang, K. and Gui, H. and Liu, F. , title =. Symmetry , volume =. 2022 , doi =
work page 2022
- [14]
-
[15]
Zhang, X. and Hou, X. and Qiao, X. and Song, W. , title =. Empirical Software Engineering , volume =. 2024 , doi =
work page 2024
-
[16]
McBurney, P. W. and McMillan, C. , title =. Empirical Software Engineering , volume =. 2016 , doi =
work page 2016
-
[17]
Iyer, S. and Konstas, I. and Cheung, A. and Zettlemoyer, L. , title =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[18]
Ahmad, W. U. and Chakraborty, S. and Ray, B. and Chang, K.-W. , title =. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =. doi:10.18653/v1/2021.naacl-main.211" , pages =
-
[19]
Wang, Y. and Wang, W. and Joty, S. and Hoi, S. C. H. , title =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , year =
work page 2021
-
[20]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Wang, Y. and Le, H. and Gotmare, A. D. and Bui, N. D. Q. and Li, J. and Hoi, S. C. H. , title =. 2023 , booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", pages =
work page 2023
- [21]
-
[22]
Husain, H. and Allamanis, M. and Wu, H.-H. and Brockschmidt, M. and Gazit, T. , title =. 2019 , archivePrefix =
work page 2019
-
[23]
Proceedings of the International Conference on Software Engineering (ICSE) , year =
Sun, Weisong and Miao, Yun and Li, Yuekang and Zhang, Hongyu and Fang, Chunrong and Liu, Yi and Deng, Gelei and Liu, Yang and Chen, Zhenyu , title =. Proceedings of the International Conference on Software Engineering (ICSE) , year =
-
[24]
Liu, Y. and Iter, D. and Xu, Y. and Wang, S. and Xu, R. and Zhu, C. , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023 , doi =
work page 2023
- [25]
-
[26]
Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , articleno =
Mastropaolo, Antonio and Ciniselli, Matteo and Di Penta, Massimiliano and Bavota, Gabriele , title =. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , articleno =
- [27]
-
[28]
Automated Software Engineering , volume =
Su, Chia-Yi and McMillan, Collin , title =. Automated Software Engineering , volume =. 2024 , doi =
work page 2024
-
[29]
Journal of Systems and Software , volume =
Nazar, Najam and Aleti, Aldeida and Zheng, Yaokun , title =. Journal of Systems and Software , volume =
-
[30]
Journal of Systems and Software , volume =
Arcelli Fontana, Francesca and Zanoni, Marco and Stella, Fabio , title =. Journal of Systems and Software , volume =. 2015 , doi =
work page 2015
-
[31]
A tool for design pattern detection and software architecture reconstruction , journal =
Francesca. A tool for design pattern detection and software architecture reconstruction , journal =. 2011 , doi =
work page 2011
-
[32]
Yarahmadi, H. and Hasheminejad, S. M. H. , title =. Artificial Intelligence Review , volume =. 2020 , doi =
work page 2020
-
[33]
Proceedings of the 29th Conference on Pattern Languages of Programs , articleno =
Moreira, Rodrigo and Fernandes, Eduardo and Figueiredo, Eduardo , title =. Proceedings of the 29th Conference on Pattern Languages of Programs , articleno =
-
[34]
2012 16th European Conference on Software Maintenance and Reengineering , year =
Fontana, Francesca Arcelli and Caracciolo, Andrea and Zanoni, Marco , title =. 2012 16th European Conference on Software Maintenance and Reengineering , year =
work page 2012
-
[35]
Alnusair, A. and Zhao, T. and Yan, G. , title =. International Journal on Software Tools for Technology Transfer (STTT) , volume =. 2014 , pages =. doi:10.1007/s10009-013-0292-z
-
[36]
Empirical Software Engineering , volume =
Nazar, Najam and Sikka, Sameer and Treude, Christoph , title =. Empirical Software Engineering , volume =. 2026 , doi =
work page 2026
-
[37]
Kamal Eddine, Moussa and Shang, Guokan and Tixier, Antoine and Vazirgiannis, Michalis , title =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =. doi:10.18653/v1/2022.acl-long.93 , pages =
-
[38]
Empirical Software Engineering , volume =
Pandey, Sushant Kumar and Chand, Sivajeet and Horkoff, Jennifer and Staron, Miroslaw and Ochodek, Miroslaw and Durisic, Darko , title =. Empirical Software Engineering , volume =. 2025 , doi =
work page 2025
-
[39]
2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code) , year =
Pan, Zhenyu and Song, Xuefeng and Wang, Yunkun and Cao, Rongyu and Li, Binhua and Li, Yongbin and Liu, Han , title =. 2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code) , year =
work page 2025
- [40]
- [41]
- [42]
-
[43]
Jiang, Albert Q. and Sablayrolles, Alexandre and Roux, Antoine and Mensch, Arthur and Savary, Blanche and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bou Hanna, Emma and Bressand, Florian and Lengyel, Gianna and Bour, Guillaume and Lample, Guillaume and Lavaud, L. Mixtral of Experts , year =
-
[44]
International Journal of Software Engineering and Knowledge Engineering , volume =
Nazar, Najam and Chen, Norman and Chong, Chun Yong , title =. International Journal of Software Engineering and Knowledge Engineering , volume =. 2023 , doi =
work page 2023
-
[45]
Schindler, Christian and Rausch, Andreas , title =. Proceedings of the Seventeenth International Conference on Pervasive Patterns and Applications (PATTERNS 2025) , year =
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.