CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories
Pith reviewed 2026-05-20 09:09 UTC · model grok-4.3
The pith
CommitDistill extracts typed knowledge from git commit messages with regex and retrieves it at 0.75 hit-rate under a 256-character budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CommitDistill mines a local git history into typed knowledge units using deterministic regex and surfaces them through a TF-IDF retriever with a calibrated silence threshold that abstains on out-of-distribution queries, delivering a 0.750 hit-rate at a 256-character per-query budget on a 12-query benchmark while baselines lag far behind.
What carries the argument
Typed knowledge units (Facts, Skills, Patterns) extracted by deterministic regex from commit messages and retrieved by TF-IDF with an abstention threshold at theta = 2.5.
If this is right
- Developers and LLM agents gain a local, inspectable memory substrate that reuses commit history without embeddings or external services.
- Extraction of 1,167 units from 25,000 commits across five repositories completes in seconds on ordinary hardware.
- The abstention mechanism prevents low-confidence retrieval, which may reduce noisy outputs in agent-assisted coding tasks.
- A four-arm evaluation on 200 time-travel bug fixes shows no statistically detectable lift over control, indicating the need for tighter integration with the downstream task.
Where Pith is reading between the lines
- The same regex-plus-TF-IDF pipeline could be applied to issue threads and pull-request discussions to enlarge the memory layer beyond commits.
- Adding a lightweight update mechanism for new commits would turn the static extraction into an incremental, always-current store.
- Measuring downstream code-change quality when CommitDistill is wired into an LLM agent would test whether the 0.75 retrieval rate translates into measurable productivity gains.
Load-bearing premise
Deterministic regex patterns can reliably surface useful, non-redundant knowledge units from commit messages without significant manual tuning or domain-specific rules per repository.
What would settle it
Running the 12-query benchmark on a fresh collection of repositories and finding the 256-character hit-rate drops below 0.5 while noise in the extracted units rises sharply.
Figures
read the original abstract
Software repositories accumulate large amounts of unstructured knowledge in commit messages, pull-request discussions, and issue threads, but developers and AI coding assistants rarely reuse this history effectively. Recent work on typed-memory architectures for LLM agents (MemGPT, generative agents, and the PlugMem module of Yang et al.) argues that agent memory should be distilled, typed knowledge rather than raw interaction text. We adapt that stance to a software repository's own git history under a constrained regime: deterministic, dependency-free, local-only, no embeddings. We present CommitDistill, an open-source Python prototype that mines a local git history into typed knowledge units (Facts, Skills, Patterns) using deterministic regex and surfaces them through a TF-IDF retriever with a calibrated silence threshold (theta = 2.5) that abstains on out-of-distribution queries. The artefact is a trust-instrumented memory substrate: deterministic, no external service, inspectable plain-JSON store, tunable abstention. A case study on five public repositories spanning Python, JavaScript, C, and Java (25,000 commits, 1,167 extracted units) reports useful-precision 0.525 at Cohen's kappa = 0.633 on 40 dual-annotated Python units. The decisive finding is budget-constrained retrieval: at a 256-character per-query budget, CommitDistill reaches 0.750 hit-rate on a 12-query benchmark against BM25's 0.333 and git log --grep's 0.083. On a four-arm paired LLM-as-judge evaluation (n=200 time-travel bug-fixes, two judges) covering control, CommitDistill, a body-budget-matched CD-Hybrid, and BM25, no condition produces a statistically detectable lift over control on the headline mean and CD-Hybrid is indistinguishable from BM25 head-to-head. Extraction over 10,000 commits completes in under 4 seconds on a laptop. Source, annotations, baselines, and a reproducibility script accompany this paper.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CommitDistill, a deterministic, local-only Python prototype that extracts typed knowledge units (Facts, Skills, Patterns) from git commit messages via regex patterns, stores them in plain JSON, and retrieves them with a TF-IDF retriever plus a calibrated silence threshold (theta=2.5). On five public repositories (25k commits, 1,167 units), it reports useful-precision 0.525 and Cohen's kappa 0.633 on 40 dual-annotated units. The central empirical claim is a 0.750 hit-rate at 256-character budget on a 12-query benchmark, outperforming BM25 (0.333) and git log --grep (0.083). A separate four-arm LLM-as-judge study (n=200) finds no statistically detectable lift over control.
Significance. If the performance claims can be substantiated with adequate statistical support, the work provides a practical, inspectable, dependency-free memory substrate for repository-aware AI coding tools. Strengths include the emphasis on determinism, reproducibility (source, annotations, and script provided), and local execution without embeddings or external services. These align with needs for verifiable agent memory in software engineering.
major comments (3)
- [Evaluation] The 0.750 hit-rate claim on the 12-query benchmark (Evaluation section) lacks query sampling procedure, hit definition, variance estimates, or error bars. With N=12 this provides negligible statistical power; the n=200 LLM-as-judge arm showing no detectable lift over control indicates the small-benchmark result may reflect selection or noise rather than reliable superiority over BM25 and git log --grep.
- [Method] The core assumption that deterministic regex patterns reliably surface useful, non-redundant knowledge units without significant manual tuning or repository-specific rules is load-bearing for the extraction pipeline but receives limited validation; the paper should report ablation or cross-repository consistency metrics for the patterns used to produce the 1,167 units.
- [Retrieval] The silence threshold (theta = 2.5) is presented as calibrated yet no details are given on the calibration procedure, sensitivity analysis, or how abstention affects the reported hit-rate and useful-precision; this directly impacts the abstention behavior on out-of-distribution queries.
minor comments (2)
- [Abstract] Clarify in the abstract and results how the 12-query benchmark was constructed relative to the 200 time-travel bug-fix cases used in the LLM-as-judge study.
- [Evaluation] The four-arm paired design is a positive; consider reporting per-arm means with confidence intervals and the exact statistical test used to conclude 'no detectable lift'.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify the presentation of our evaluation, extraction method, and retrieval components. We respond to each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Evaluation] The 0.750 hit-rate claim on the 12-query benchmark (Evaluation section) lacks query sampling procedure, hit definition, variance estimates, or error bars. With N=12 this provides negligible statistical power; the n=200 LLM-as-judge arm showing no detectable lift over control indicates the small-benchmark result may reflect selection or noise rather than reliable superiority over BM25 and git log --grep.
Authors: We agree that the 12-query benchmark has limited statistical power and that the n=200 LLM-as-judge evaluation (which shows no detectable lift) is the more robust result. In the revision we will add: (1) the query sampling procedure (queries were chosen to span bug fixes, feature additions, and refactoring across the five repositories), (2) a precise definition of a hit (a retrieved unit whose content directly addresses the query intent), and (3) bootstrap confidence intervals for the hit-rate. We will also add an explicit caveat on sample size and emphasize that the small benchmark is illustrative rather than definitive. This constitutes a partial revision because we retain the original numbers while improving transparency. revision: partial
-
Referee: [Method] The core assumption that deterministic regex patterns reliably surface useful, non-redundant knowledge units without significant manual tuning or repository-specific rules is load-bearing for the extraction pipeline but receives limited validation; the paper should report ablation or cross-repository consistency metrics for the patterns used to produce the 1,167 units.
Authors: We acknowledge that additional validation of the regex patterns would strengthen the claims. Although the patterns were intentionally kept general and deterministic, we will add in the revised manuscript an ablation study that removes each pattern category in turn and reports the resulting change in unit count and downstream retrieval metrics. We will also report cross-repository consistency (e.g., Jaccard overlap of extracted units and type distribution across the five repositories). These analyses will be presented in a new subsection of the Method section. revision: yes
-
Referee: [Retrieval] The silence threshold (theta = 2.5) is presented as calibrated yet no details are given on the calibration procedure, sensitivity analysis, or how abstention affects the reported hit-rate and useful-precision; this directly impacts the abstention behavior on out-of-distribution queries.
Authors: We agree that the calibration details for theta = 2.5 were insufficient. In the revision we will describe the calibration procedure (performed on a held-out development set of queries to balance precision against abstention rate), include a sensitivity analysis over theta values from 1.0 to 4.0 showing effects on hit-rate and useful-precision, and discuss how the threshold governs abstention on out-of-distribution queries. This will clarify the relationship between abstention and the reported metrics. revision: yes
Circularity Check
No significant circularity; empirical system with external baselines
full rationale
The paper describes a deterministic regex-based extraction pipeline and TF-IDF retriever with a fixed threshold, evaluated against external baselines (BM25, git log --grep) and independent LLM-as-judge and dual-annotation protocols. No equations, predictions, or first-principles claims reduce to fitted parameters or self-citations by construction. The 12-query benchmark and n=200 arm are presented as direct measurements rather than derived outputs. The derivation chain is self-contained against external benchmarks and does not rely on load-bearing self-citation or ansatz smuggling.
Axiom & Free-Parameter Ledger
free parameters (1)
- silence threshold theta
axioms (1)
- domain assumption Regex patterns suffice to identify Facts, Skills, and Patterns in commit text across multiple languages.
Reference graph
Works this paper leans on
-
[1]
PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,
K. Yang, M. Galley, C. Wang, J. Gao, J. Han, and C. Zhai, “PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,”Microsoft Research, March 2026. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ plugmem-a-task-agnostic-plugin-memory-module-for-llm-agents/
work page 2026
-
[2]
MemGPT: Towards LLMs as Operating Systems
C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “MemGPT: Towards LLMs as Operating Systems,” arXiv:2310.08560, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Generative Agents: Interactive Simulacra of Human Behavior,
J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative Agents: Interactive Simulacra of Human Behavior,” inProc. ACM Symp. on User Interface Software and Technology (UIST), 2023, pp. 1–22
work page 2023
-
[4]
Cognitive Architectures for Language Agents,
T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths, “Cognitive Architectures for Language Agents,”Transactions on Machine Learning Research (TMLR), 2024
work page 2024
-
[5]
A Survey on the Memory Mechanism of Large Language Model based Agents
Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A Survey on the Memory Mechanism of Large Language Model based Agents,”arXiv:2404.13501, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks,
P. Lewiset al., “Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2020
work page 2020
-
[7]
Productivity Assessment of Neural Code Completion,
A. Ziegleret al., “Productivity Assessment of Neural Code Completion,” inProc. ACM SIGPLAN Int. Symp. Machine Programming (MAPS), 2022
work page 2022
-
[8]
GitHub, “GitHub Copilot Chat,” 2024. [Online]. Available: https:// github.com/features/copilot
work page 2024
-
[9]
Anysphere, “Cursor: The AI Code Editor,” 2024. [Online]. Available: https://cursor.com
work page 2024
-
[10]
Cody: AI Coding Assistant for the Enterprise,
Sourcegraph, “Cody: AI Coding Assistant for the Enterprise,” 2024. [Online]. Available: https://sourcegraph.com/cody
work page 2024
-
[11]
Evaluating Large Language Models Trained on Code
M. Chenet al., “Evaluating Large Language Models Trained on Code,” arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Mining Version Histories to Guide Software Changes,
T. Zimmermann, A. Zeller, P. Weissgerber, and S. Diehl, “Mining Version Histories to Guide Software Changes,”IEEE Trans. Softw. Eng., vol. 31, no. 6, pp. 429–445, 2005
work page 2005
-
[13]
Discovering Common Bug-Fix Patterns: A Large-Scale Observational Study,
E. C. Campos and M. de A. Maia, “Discovering Common Bug-Fix Patterns: A Large-Scale Observational Study,”J. Softw. Evol. Process, vol. 31, no. 7, 2019
work page 2019
-
[14]
G. Melo, T. Oliveira, P. Alencar, and D. Cowan, “Knowledge Reuse in Software Projects: Retrieving Software Development Q&A Posts Based on Project Task Similarity,”PLoS ONE, vol. 15, no. 12, e0243852, 2020
work page 2020
-
[15]
Populating a Release History Database from Version Control and Bug Tracking Systems,
M. Fischer, M. Pinzger, and H. Gall, “Populating a Release History Database from Version Control and Bug Tracking Systems,” inProc. IEEE Int. Conf. Softw. Maintenance (ICSM), 2003, pp. 23–32
work page 2003
-
[16]
L. P. Hattori and M. Lanza, “On the Nature of Commits,” inProc. ASE Workshop on Mining Software Repositories, 2008, pp. 63–71
work page 2008
-
[17]
Augmenting API Documentation with Insights from Stack Overflow,
C. Treude and M. P. Robillard, “Augmenting API Documentation with Insights from Stack Overflow,” inProc. Int. Conf. Softw. Eng. (ICSE), 2016, pp. 392–403
work page 2016
-
[18]
R. J. Wieringa,Design Science Methodology for Information Systems and Software Engineering. Springer, 2014
work page 2014
-
[19]
Wohlinet al.,Experimentation in Software Engineering
C. Wohlinet al.,Experimentation in Software Engineering. Springer, 2012
work page 2012
-
[20]
The measurement of observer agreement for categorical data,
J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977
work page 1977
-
[21]
When do changes induce fixes?,
J. ´Sliwerski, T. Zimmermann, and A. Zeller, “When do changes induce fixes?,” inProc. Int. Workshop on Mining Software Repositories (MSR), 2005, pp. 1–5
work page 2005
-
[22]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.