Recognition: 1 theorem link
· Lean TheoremStory Point Estimation Using Large Language Models
Pith reviewed 2026-05-15 15:24 UTC · model grok-4.3
The pith
Large language models predict story points more accurately than supervised deep learning models even with zero training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Without any training data, large language models using zero-shot prompting predict story points for backlog items better than deep neural networks trained on 80 percent of the data from the same project. Adding a small number of examples through few-shot prompting raises accuracy still more. Comparative judgments between pairs of items are not easier for the models to predict than direct story-point values, yet they remain useful as few-shot examples for improving story-point predictions.
What carries the argument
Zero-shot and few-shot prompting of large language models applied directly to item titles and descriptions to output story point estimates.
If this is right
- Software teams can apply LLMs to estimate effort on new projects without first collecting large amounts of labeled historical data.
- A few human-annotated examples or pairwise comparisons can be added to prompting to raise prediction accuracy.
- Comparative judgments between items serve as effective few-shot examples even if they are not easier to predict than direct values.
- LLMs reduce dependence on project-specific training datasets for agile effort estimation tasks.
Where Pith is reading between the lines
- Teams without historical data could adopt LLM-based estimation as a starting point and refine it with minimal examples.
- The same prompting strategy might extend to related subjective judgments such as priority ranking or risk assessment.
- Combining zero-shot LLM outputs with actual time logs from completed tasks could create hybrid estimators for future projects.
Load-bearing premise
The 16 projects and four language models tested are representative enough that the zero-shot advantage will hold for other projects and models.
What would settle it
A fresh software project where a deep learning model trained on 80 percent of its own data produces more accurate story point predictions than zero-shot LLM prompting would falsify the central claim.
Figures
read the original abstract
This study investigates the use of large language models (LLMs) for story point estimation. Story points are unitless, project-specific effort estimates that help developers on the scrum team forecast which product backlog items they plan to complete in a sprint. To facilitate this process, machine learning models, especially deep neural networks, have been applied to predict the story points based on the title and description of each item. However, such machine learning models require sufficient amounts of training data (with ground truth story points annotated by human developers) from the same software project to achieve decent prediction performance. This motivated us to explore whether LLMs are capable of (RQ1) predicting story points without training data or (RQ2) with only a few training data points. Our empirical results with four LLMs on 16 software projects show that, without any training data (zero-shot prompting), LLMs can predict story points better than supervised deep learning models trained on 80% of the data. The prediction performance of LLMs can be further improved with a few training examples (few-shot prompting). In addition, a recent study explored the use of comparative judgments (between a given pair of items which one requires more effort to implement) instead of directly annotating the story points to reduce the cognitive burden on developers. Therefore, this study also explores (RQ3) whether comparative judgments are easier to predict than story points for LLMs and (RQ4) whether comparative judgments can serve as few-shot examples for LLMs to improve their predictions of story points. Empirical results suggest that it is not easier for LLMs to predict comparative judgments than to directly estimate the story points, but comparative judgments can serve as few-shot examples to improve the LLMs' prediction performance as well as the human-annotated story points.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates large language models for story point estimation in agile software projects. It claims that zero-shot prompting allows four LLMs to outperform supervised deep learning models trained on 80% of the data across 16 public projects (RQ1), that few-shot prompting further improves performance (RQ2), that comparative judgments are not easier for LLMs to predict than direct story points (RQ3), and that comparative judgments can serve as effective few-shot examples (RQ4).
Significance. If the zero-shot superiority result holds after contamination checks and full methodological disclosure, the work would be significant for software engineering practice: it would demonstrate that LLMs can deliver usable effort estimates without project-specific labeled data, reducing the data-collection barrier that currently limits supervised approaches and potentially enabling broader adoption of automated estimation in small or new projects.
major comments (3)
- [Abstract, §4] Abstract and §4 (empirical results): the central claim that zero-shot LLMs outperform DL models trained on 80% of the data is load-bearing yet rests on an unverified assumption that the 16 public projects contain no pretraining overlap with the LLMs. No membership-inference test, temporal cutoff analysis, or decontamination step is described; without it the performance edge may reflect memorization rather than generalization.
- [§3] §3 (methodology): the prompting strategies, exact model versions, temperature settings, and output parsing rules are not specified in sufficient detail to allow reproduction or to rule out prompt-engineering artifacts. The evaluation metrics (MAE, accuracy, or rank correlation?) and any statistical significance tests comparing zero-shot vs. supervised baselines are also omitted.
- [§4, Table 2] §4 and Table 2 (project selection): the 16 projects are drawn from public issue trackers, but no cross-project validation scheme, project-size stratification, or control for domain variability is reported. This weakens the generalizability assertion for both the zero-shot and few-shot results.
minor comments (2)
- [§2] Notation for story-point scales and comparative-judgment encoding should be defined once in §2 and used consistently; currently the mapping from LLM output tokens to numeric story points is described only informally.
- [Figure 3] Figure 3 (few-shot curves) lacks error bars or confidence intervals, making it difficult to judge whether the reported gains over zero-shot are statistically reliable.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point-by-point below, indicating the revisions we will make to improve methodological transparency and address potential threats to validity.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (empirical results): the central claim that zero-shot LLMs outperform DL models trained on 80% of the data is load-bearing yet rests on an unverified assumption that the 16 public projects contain no pretraining overlap with the LLMs. No membership-inference test, temporal cutoff analysis, or decontamination step is described; without it the performance edge may reflect memorization rather than generalization.
Authors: We agree that potential contamination from public issue trackers is a valid concern for LLM-based claims. The original manuscript did not include explicit decontamination or membership-inference tests. In the revision we will add a dedicated subsection in §4 that (1) lists the known training cutoffs for each of the four LLMs, (2) performs a temporal analysis using issue creation dates to identify post-cutoff projects, and (3) reports zero-shot results restricted to those post-cutoff issues. We will also note the absence of full membership-inference testing as a limitation and discuss why story-point labels are unlikely to have been directly memorized even if issue text was seen. revision: yes
-
Referee: [§3] §3 (methodology): the prompting strategies, exact model versions, temperature settings, and output parsing rules are not specified in sufficient detail to allow reproduction or to rule out prompt-engineering artifacts. The evaluation metrics (MAE, accuracy, or rank correlation?) and any statistical significance tests comparing zero-shot vs. supervised baselines are also omitted.
Authors: We acknowledge that the current §3 lacks the level of detail needed for reproducibility. In the revised manuscript we will expand §3 with: exact model identifiers and versions (e.g., gpt-4-0613, Llama-2-70b-chat), temperature=0 for all runs, the complete zero-shot and few-shot prompt templates, and the deterministic parsing rules used to extract numeric story-point values from free-form LLM output. We will also state that Mean Absolute Error (MAE) is the primary metric, supplemented by thresholded accuracy, and will add Wilcoxon signed-rank tests with p-values for all zero-shot versus supervised comparisons. revision: yes
-
Referee: [§4, Table 2] §4 and Table 2 (project selection): the 16 projects are drawn from public issue trackers, but no cross-project validation scheme, project-size stratification, or control for domain variability is reported. This weakens the generalizability assertion for both the zero-shot and few-shot results.
Authors: The 16 projects were selected to span different domains and sizes (as summarized in Table 2), but we did not explicitly describe stratification or cross-project protocols. In the revision we will add a paragraph detailing the selection criteria, report project sizes (number of issues) and primary domains, and include a supplementary analysis that stratifies MAE results by project size quartiles. We will also clarify that the supervised baselines use within-project 80/20 splits and will discuss cross-project generalization as an explicit limitation and direction for future work. revision: partial
Circularity Check
No circularity: empirical LLM comparisons rest on external benchmarks and direct measurements
full rationale
The paper conducts an empirical evaluation of zero-shot and few-shot LLM prompting for story point estimation, directly comparing performance metrics against supervised deep learning models trained on 80% of the same 16 project datasets. All claims derive from observable prediction accuracy on held-out items rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the result to its own inputs. The methodology is self-contained and externally replicable without invoking uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can interpret task titles and descriptions to estimate relative effort without domain-specific training data from the target project.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
zero-shot prompting... LLMs can predict story points better than supervised deep learning models trained on 80% of the data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The scrum guide: The definitive guide to scrum: The rules of the game,
K. Schwaber and J. Sutherland, “The scrum guide: The definitive guide to scrum: The rules of the game,” https://scrumguides.org/scrum-guide.html, 2020, accessed: 2026-02-10
work page 2020
-
[2]
Cohn,Agile estimating and planning
M. Cohn,Agile estimating and planning. Pearson Education, 2005
work page 2005
-
[3]
A deep learning model for estimating story points,
M. Choetkiertikul, H. K. Dam, T. Tran, T. Pham, A. Ghose, and T. Menzies, “A deep learning model for estimating story points,”IEEE Transactions on Software Engineering, vol. 45, no. 7, pp. 637–656, 2018
work page 2018
-
[4]
Gpt2sp: A transformer-based agile story point estimation approach,
M. Fu and C. Tantithamthavorn, “Gpt2sp: A transformer-based agile story point estimation approach,”IEEE Transactions on Software Engineering, vol. 49, no. 2, pp. 611–625, 2022
work page 2022
-
[5]
A systematic review of software effort estimation using machine learning,
M. Shepperd, S. Counsell, R. C. Sharp, and B. Bowes, “A systematic review of software effort estimation using machine learning,”Information and Software Technology, vol. 54, no. 1, pp. 41–54, 2012
work page 2012
-
[6]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplanet al., “Language models are few-shot learners,”Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020
work page 1901
-
[7]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosmaet al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022
work page 2022
-
[9]
Towards an understanding of large language models in software engineering tasks,
Z. Zheng, K. Ning, Q. Zhong, J. Chenet al., “Towards an understanding of large language models in software engineering tasks,”arXiv preprint, 2023, arXiv:2308.11396. [Online]. Available: https://arxiv.org/abs/2308.11396
-
[10]
Efficient story point estimation with comparative learning,
M. M. Khan, X. Xi, A. Meneely, and Z. Yu, “Efficient story point estimation with comparative learning,” 2025. [Online]. Available: https://arxiv.org/abs/2507.14642
-
[11]
L. L. Thurstone, “A law of comparative judgment.”Psychological review, vol. 34, p. 273–286, 1927
work page 1927
-
[12]
Rank analysis of incomplete block designs: I. the method of paired comparisons,
R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952
work page 1952
-
[13]
B. W. Boehm, C. Abts, A. W. Brown, S. Chulani, B. K. Clark, E. Horowitz, R. Madachy, D. J. Reifer, and B. Steece,Software Cost Estimation with COCOMO II. Prentice Hall, 2000
work page 2000
-
[14]
A review of studies on expert estimation of software development effort,
M. Jørgensen, “A review of studies on expert estimation of software development effort,”Journal of Systems and Software, vol. 70, no. 1–2, pp. 37–60, 2004
work page 2004
-
[15]
Measuring application development productivity,
A. J. Albrecht, “Measuring application development productivity,”Pro- ceedings of the Joint SHARE/GUIDE/IBM Application Development Symposium, pp. 83–92, 1979
work page 1979
-
[16]
Resource estimation for objectory projects,
G. Karner, “Resource estimation for objectory projects,” inObjective Systems SF AB Working Paper, 1993
work page 1993
-
[17]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022. [Online]. Available: https://arxiv.org/abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Large language models for software engineering: A systematic literature review,
X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, 2024. [Online]. Available: https://arxiv.org/abs/2308.10620
-
[19]
A survey on large language models for software engineering,
Q. Zhang, C. Fang, Y . Xie, Y . Zhang, Y . Yang, W. Sun, S. Yu, and Z. Chen, “A survey on large language models for software engineering,”arXiv preprint arXiv:2312.15223, 2023. [Online]. Available: https://arxiv.org/abs/2312.15223
-
[20]
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tanget al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,”arXiv preprint arXiv:2102.04664, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Pinto, J. Kaplan, H. Edwardset al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
[Online]. Available: https://arxiv.org/abs/2306.03091
work page internal anchor Pith review arXiv
-
[24]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?”arXiv preprint arXiv:2310.06770, 2023. [Online]. Available: https://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding,
A. Eghbali and M. Pradel, “De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding,”arXiv preprint arXiv:2401.01701, 2024. [Online]. Available: https://arxiv.org/abs/2401. 01701
-
[26]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AIet al., “Deepseek-v3.2: Pushing the frontier of open large language models,”arXiv preprint arXiv:2512.02556, 2025. [Online]. Available: https://arxiv.org/abs/2512.02556
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Google, “Gemini 2.5 Flash-Lite,” https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/2-5-flash-lite, 2025, accessed: 2026- 03-05
work page 2025
-
[28]
OpenAI, “GPT-5 nano model,” https://developers.openai.com/api/docs/ models/gpt-5-nano, 2025, accessed: 2026-03-05
work page 2025
-
[29]
Kimi K2: Open Agentic Intelligence
M. AI, “Kimi k2: Advanced LLM with 128k context,”arXiv preprint arXiv:2507.20534, 2025. [Online]. Available: https://arxiv.org/abs/2507. 20534
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.