CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories

Aditya Aggarwal; Amith Tallanki; Divya Chukkapalli; Harsimran Singh; Thejesh Avula

arxiv: 2605.18284 · v1 · pith:QNCWI264new · submitted 2026-05-18 · 💻 cs.SE · cs.AI

CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories

Divya Chukkapalli , Thejesh Avula , Aditya Aggarwal , Harsimran Singh , Amith Tallanki This is my paper

Pith reviewed 2026-05-20 09:09 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords commit messagesknowledge extractionmemory layersoftware repositoriesTF-IDF retrievalabstentionLLM agentsgit history

0 comments

The pith

CommitDistill extracts typed knowledge from git commit messages with regex and retrieves it at 0.75 hit-rate under a 256-character budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CommitDistill as a local, dependency-free prototype that turns unstructured git commit history into typed knowledge units labeled Facts, Skills, and Patterns. It mines these units with deterministic regex, stores them in plain JSON, and retrieves relevant ones with a TF-IDF index that abstains via a calibrated silence threshold. The central empirical result shows this approach reaching a 0.750 hit-rate on a 12-query benchmark at a tight 256-character per-query limit, compared with 0.333 for BM25 and 0.083 for git log --grep. A sympathetic reader would care because software repositories contain large amounts of historical knowledge that developers and LLM coding assistants currently fail to reuse effectively. The design choices emphasize inspectability, speed (under 4 seconds for 10,000 commits), and trust through abstention rather than always returning results.

Core claim

CommitDistill mines a local git history into typed knowledge units using deterministic regex and surfaces them through a TF-IDF retriever with a calibrated silence threshold that abstains on out-of-distribution queries, delivering a 0.750 hit-rate at a 256-character per-query budget on a 12-query benchmark while baselines lag far behind.

What carries the argument

Typed knowledge units (Facts, Skills, Patterns) extracted by deterministic regex from commit messages and retrieved by TF-IDF with an abstention threshold at theta = 2.5.

If this is right

Developers and LLM agents gain a local, inspectable memory substrate that reuses commit history without embeddings or external services.
Extraction of 1,167 units from 25,000 commits across five repositories completes in seconds on ordinary hardware.
The abstention mechanism prevents low-confidence retrieval, which may reduce noisy outputs in agent-assisted coding tasks.
A four-arm evaluation on 200 time-travel bug fixes shows no statistically detectable lift over control, indicating the need for tighter integration with the downstream task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regex-plus-TF-IDF pipeline could be applied to issue threads and pull-request discussions to enlarge the memory layer beyond commits.
Adding a lightweight update mechanism for new commits would turn the static extraction into an incremental, always-current store.
Measuring downstream code-change quality when CommitDistill is wired into an LLM agent would test whether the 0.75 retrieval rate translates into measurable productivity gains.

Load-bearing premise

Deterministic regex patterns can reliably surface useful, non-redundant knowledge units from commit messages without significant manual tuning or domain-specific rules per repository.

What would settle it

Running the 12-query benchmark on a fresh collection of repositories and finding the 256-character hit-rate drops below 0.5 while noise in the extracted units rises sharply.

Figures

Figures reproduced from arXiv: 2605.18284 by Aditya Aggarwal, Amith Tallanki, Divya Chukkapalli, Harsimran Singh, Thejesh Avula.

**Figure 2.** Figure 2: Six of the nine extraction heuristics. Each pattern is associated with [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Software repositories accumulate large amounts of unstructured knowledge in commit messages, pull-request discussions, and issue threads, but developers and AI coding assistants rarely reuse this history effectively. Recent work on typed-memory architectures for LLM agents (MemGPT, generative agents, and the PlugMem module of Yang et al.) argues that agent memory should be distilled, typed knowledge rather than raw interaction text. We adapt that stance to a software repository's own git history under a constrained regime: deterministic, dependency-free, local-only, no embeddings. We present CommitDistill, an open-source Python prototype that mines a local git history into typed knowledge units (Facts, Skills, Patterns) using deterministic regex and surfaces them through a TF-IDF retriever with a calibrated silence threshold (theta = 2.5) that abstains on out-of-distribution queries. The artefact is a trust-instrumented memory substrate: deterministic, no external service, inspectable plain-JSON store, tunable abstention. A case study on five public repositories spanning Python, JavaScript, C, and Java (25,000 commits, 1,167 extracted units) reports useful-precision 0.525 at Cohen's kappa = 0.633 on 40 dual-annotated Python units. The decisive finding is budget-constrained retrieval: at a 256-character per-query budget, CommitDistill reaches 0.750 hit-rate on a 12-query benchmark against BM25's 0.333 and git log --grep's 0.083. On a four-arm paired LLM-as-judge evaluation (n=200 time-travel bug-fixes, two judges) covering control, CommitDistill, a body-budget-matched CD-Hybrid, and BM25, no condition produces a statistically detectable lift over control on the headline mean and CD-Hybrid is indistinguishable from BM25 head-to-head. Extraction over 10,000 commits completes in under 4 seconds on a laptop. Source, annotations, baselines, and a reproducibility script accompany this paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 0.750 hit-rate claim rests on a 12-query benchmark that is too small and underspecified, while the n=200 LLM-as-judge arm shows no detectable improvement over baselines.

read the letter

The main thing to know is that CommitDistill's strongest reported result comes from a 12-query retrieval test where it hits 0.750 at a 256-character budget, beating BM25 and git grep. That number is hard to trust because the test set is tiny, with no sampling details, no variance, and no clear definition of what counts as a hit. Their own larger four-arm study on 200 time-travel bug fixes finds no statistical lift for CommitDistill over control or the hybrid variant, which undercuts the headline performance story quite a bit.

Referee Report

3 major / 2 minor

Summary. The paper presents CommitDistill, a deterministic, local-only Python prototype that extracts typed knowledge units (Facts, Skills, Patterns) from git commit messages via regex patterns, stores them in plain JSON, and retrieves them with a TF-IDF retriever plus a calibrated silence threshold (theta=2.5). On five public repositories (25k commits, 1,167 units), it reports useful-precision 0.525 and Cohen's kappa 0.633 on 40 dual-annotated units. The central empirical claim is a 0.750 hit-rate at 256-character budget on a 12-query benchmark, outperforming BM25 (0.333) and git log --grep (0.083). A separate four-arm LLM-as-judge study (n=200) finds no statistically detectable lift over control.

Significance. If the performance claims can be substantiated with adequate statistical support, the work provides a practical, inspectable, dependency-free memory substrate for repository-aware AI coding tools. Strengths include the emphasis on determinism, reproducibility (source, annotations, and script provided), and local execution without embeddings or external services. These align with needs for verifiable agent memory in software engineering.

major comments (3)

[Evaluation] The 0.750 hit-rate claim on the 12-query benchmark (Evaluation section) lacks query sampling procedure, hit definition, variance estimates, or error bars. With N=12 this provides negligible statistical power; the n=200 LLM-as-judge arm showing no detectable lift over control indicates the small-benchmark result may reflect selection or noise rather than reliable superiority over BM25 and git log --grep.
[Method] The core assumption that deterministic regex patterns reliably surface useful, non-redundant knowledge units without significant manual tuning or repository-specific rules is load-bearing for the extraction pipeline but receives limited validation; the paper should report ablation or cross-repository consistency metrics for the patterns used to produce the 1,167 units.
[Retrieval] The silence threshold (theta = 2.5) is presented as calibrated yet no details are given on the calibration procedure, sensitivity analysis, or how abstention affects the reported hit-rate and useful-precision; this directly impacts the abstention behavior on out-of-distribution queries.

minor comments (2)

[Abstract] Clarify in the abstract and results how the 12-query benchmark was constructed relative to the 200 time-travel bug-fix cases used in the LLM-as-judge study.
[Evaluation] The four-arm paired design is a positive; consider reporting per-arm means with confidence intervals and the exact statistical test used to conclude 'no detectable lift'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our evaluation, extraction method, and retrieval components. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Evaluation] The 0.750 hit-rate claim on the 12-query benchmark (Evaluation section) lacks query sampling procedure, hit definition, variance estimates, or error bars. With N=12 this provides negligible statistical power; the n=200 LLM-as-judge arm showing no detectable lift over control indicates the small-benchmark result may reflect selection or noise rather than reliable superiority over BM25 and git log --grep.

Authors: We agree that the 12-query benchmark has limited statistical power and that the n=200 LLM-as-judge evaluation (which shows no detectable lift) is the more robust result. In the revision we will add: (1) the query sampling procedure (queries were chosen to span bug fixes, feature additions, and refactoring across the five repositories), (2) a precise definition of a hit (a retrieved unit whose content directly addresses the query intent), and (3) bootstrap confidence intervals for the hit-rate. We will also add an explicit caveat on sample size and emphasize that the small benchmark is illustrative rather than definitive. This constitutes a partial revision because we retain the original numbers while improving transparency. revision: partial
Referee: [Method] The core assumption that deterministic regex patterns reliably surface useful, non-redundant knowledge units without significant manual tuning or repository-specific rules is load-bearing for the extraction pipeline but receives limited validation; the paper should report ablation or cross-repository consistency metrics for the patterns used to produce the 1,167 units.

Authors: We acknowledge that additional validation of the regex patterns would strengthen the claims. Although the patterns were intentionally kept general and deterministic, we will add in the revised manuscript an ablation study that removes each pattern category in turn and reports the resulting change in unit count and downstream retrieval metrics. We will also report cross-repository consistency (e.g., Jaccard overlap of extracted units and type distribution across the five repositories). These analyses will be presented in a new subsection of the Method section. revision: yes
Referee: [Retrieval] The silence threshold (theta = 2.5) is presented as calibrated yet no details are given on the calibration procedure, sensitivity analysis, or how abstention affects the reported hit-rate and useful-precision; this directly impacts the abstention behavior on out-of-distribution queries.

Authors: We agree that the calibration details for theta = 2.5 were insufficient. In the revision we will describe the calibration procedure (performed on a held-out development set of queries to balance precision against abstention rate), include a sensitivity analysis over theta values from 1.0 to 4.0 showing effects on hit-rate and useful-precision, and discuss how the threshold governs abstention on out-of-distribution queries. This will clarify the relationship between abstention and the reported metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system with external baselines

full rationale

The paper describes a deterministic regex-based extraction pipeline and TF-IDF retriever with a fixed threshold, evaluated against external baselines (BM25, git log --grep) and independent LLM-as-judge and dual-annotation protocols. No equations, predictions, or first-principles claims reduce to fitted parameters or self-citations by construction. The 12-query benchmark and n=200 arm are presented as direct measurements rather than derived outputs. The derivation chain is self-contained against external benchmarks and does not rely on load-bearing self-citation or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that regex patterns can extract semantically useful units and that TF-IDF with a fixed threshold provides reliable abstention. No new physical entities are postulated.

free parameters (1)

silence threshold theta
Set to 2.5 and described as calibrated; directly controls when the system abstains on out-of-distribution queries.

axioms (1)

domain assumption Regex patterns suffice to identify Facts, Skills, and Patterns in commit text across multiple languages.
Invoked when the prototype mines 1,167 units from 25,000 commits.

pith-pipeline@v0.9.0 · 5923 in / 1358 out tokens · 38135 ms · 2026-05-20T09:09:07.718044+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

[1]

PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,

K. Yang, M. Galley, C. Wang, J. Gao, J. Han, and C. Zhai, “PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,”Microsoft Research, March 2026. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ plugmem-a-task-agnostic-plugin-memory-module-for-llm-agents/

work page 2026
[2]

MemGPT: Towards LLMs as Operating Systems

C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “MemGPT: Towards LLMs as Operating Systems,” arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Generative Agents: Interactive Simulacra of Human Behavior,

J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative Agents: Interactive Simulacra of Human Behavior,” inProc. ACM Symp. on User Interface Software and Technology (UIST), 2023, pp. 1–22

work page 2023
[4]

Cognitive Architectures for Language Agents,

T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths, “Cognitive Architectures for Language Agents,”Transactions on Machine Learning Research (TMLR), 2024

work page 2024
[5]

A Survey on the Memory Mechanism of Large Language Model based Agents

Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A Survey on the Memory Mechanism of Large Language Model based Agents,”arXiv:2404.13501, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks,

P. Lewiset al., “Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2020

work page 2020
[7]

Productivity Assessment of Neural Code Completion,

A. Ziegleret al., “Productivity Assessment of Neural Code Completion,” inProc. ACM SIGPLAN Int. Symp. Machine Programming (MAPS), 2022

work page 2022
[8]

GitHub Copilot Chat,

GitHub, “GitHub Copilot Chat,” 2024. [Online]. Available: https:// github.com/features/copilot

work page 2024
[9]

Cursor: The AI Code Editor,

Anysphere, “Cursor: The AI Code Editor,” 2024. [Online]. Available: https://cursor.com

work page 2024
[10]

Cody: AI Coding Assistant for the Enterprise,

Sourcegraph, “Cody: AI Coding Assistant for the Enterprise,” 2024. [Online]. Available: https://sourcegraph.com/cody

work page 2024
[11]

Evaluating Large Language Models Trained on Code

M. Chenet al., “Evaluating Large Language Models Trained on Code,” arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Mining Version Histories to Guide Software Changes,

T. Zimmermann, A. Zeller, P. Weissgerber, and S. Diehl, “Mining Version Histories to Guide Software Changes,”IEEE Trans. Softw. Eng., vol. 31, no. 6, pp. 429–445, 2005

work page 2005
[13]

Discovering Common Bug-Fix Patterns: A Large-Scale Observational Study,

E. C. Campos and M. de A. Maia, “Discovering Common Bug-Fix Patterns: A Large-Scale Observational Study,”J. Softw. Evol. Process, vol. 31, no. 7, 2019

work page 2019
[14]

Knowledge Reuse in Software Projects: Retrieving Software Development Q&A Posts Based on Project Task Similarity,

G. Melo, T. Oliveira, P. Alencar, and D. Cowan, “Knowledge Reuse in Software Projects: Retrieving Software Development Q&A Posts Based on Project Task Similarity,”PLoS ONE, vol. 15, no. 12, e0243852, 2020

work page 2020
[15]

Populating a Release History Database from Version Control and Bug Tracking Systems,

M. Fischer, M. Pinzger, and H. Gall, “Populating a Release History Database from Version Control and Bug Tracking Systems,” inProc. IEEE Int. Conf. Softw. Maintenance (ICSM), 2003, pp. 23–32

work page 2003
[16]

On the Nature of Commits,

L. P. Hattori and M. Lanza, “On the Nature of Commits,” inProc. ASE Workshop on Mining Software Repositories, 2008, pp. 63–71

work page 2008
[17]

Augmenting API Documentation with Insights from Stack Overflow,

C. Treude and M. P. Robillard, “Augmenting API Documentation with Insights from Stack Overflow,” inProc. Int. Conf. Softw. Eng. (ICSE), 2016, pp. 392–403

work page 2016
[18]

R. J. Wieringa,Design Science Methodology for Information Systems and Software Engineering. Springer, 2014

work page 2014
[19]

Wohlinet al.,Experimentation in Software Engineering

C. Wohlinet al.,Experimentation in Software Engineering. Springer, 2012

work page 2012
[20]

The measurement of observer agreement for categorical data,

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

work page 1977
[21]

When do changes induce fixes?,

J. ´Sliwerski, T. Zimmermann, and A. Zeller, “When do changes induce fixes?,” inProc. Int. Workshop on Mining Software Repositories (MSR), 2005, pp. 1–5

work page 2005
[22]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[1] [1]

PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,

K. Yang, M. Galley, C. Wang, J. Gao, J. Han, and C. Zhai, “PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,”Microsoft Research, March 2026. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ plugmem-a-task-agnostic-plugin-memory-module-for-llm-agents/

work page 2026

[2] [2]

MemGPT: Towards LLMs as Operating Systems

C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “MemGPT: Towards LLMs as Operating Systems,” arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Generative Agents: Interactive Simulacra of Human Behavior,

J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative Agents: Interactive Simulacra of Human Behavior,” inProc. ACM Symp. on User Interface Software and Technology (UIST), 2023, pp. 1–22

work page 2023

[4] [4]

Cognitive Architectures for Language Agents,

T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths, “Cognitive Architectures for Language Agents,”Transactions on Machine Learning Research (TMLR), 2024

work page 2024

[5] [5]

A Survey on the Memory Mechanism of Large Language Model based Agents

Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A Survey on the Memory Mechanism of Large Language Model based Agents,”arXiv:2404.13501, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks,

P. Lewiset al., “Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2020

work page 2020

[7] [7]

Productivity Assessment of Neural Code Completion,

A. Ziegleret al., “Productivity Assessment of Neural Code Completion,” inProc. ACM SIGPLAN Int. Symp. Machine Programming (MAPS), 2022

work page 2022

[8] [8]

GitHub Copilot Chat,

GitHub, “GitHub Copilot Chat,” 2024. [Online]. Available: https:// github.com/features/copilot

work page 2024

[9] [9]

Cursor: The AI Code Editor,

Anysphere, “Cursor: The AI Code Editor,” 2024. [Online]. Available: https://cursor.com

work page 2024

[10] [10]

Cody: AI Coding Assistant for the Enterprise,

Sourcegraph, “Cody: AI Coding Assistant for the Enterprise,” 2024. [Online]. Available: https://sourcegraph.com/cody

work page 2024

[11] [11]

Evaluating Large Language Models Trained on Code

M. Chenet al., “Evaluating Large Language Models Trained on Code,” arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Mining Version Histories to Guide Software Changes,

T. Zimmermann, A. Zeller, P. Weissgerber, and S. Diehl, “Mining Version Histories to Guide Software Changes,”IEEE Trans. Softw. Eng., vol. 31, no. 6, pp. 429–445, 2005

work page 2005

[13] [13]

Discovering Common Bug-Fix Patterns: A Large-Scale Observational Study,

E. C. Campos and M. de A. Maia, “Discovering Common Bug-Fix Patterns: A Large-Scale Observational Study,”J. Softw. Evol. Process, vol. 31, no. 7, 2019

work page 2019

[14] [14]

Knowledge Reuse in Software Projects: Retrieving Software Development Q&A Posts Based on Project Task Similarity,

G. Melo, T. Oliveira, P. Alencar, and D. Cowan, “Knowledge Reuse in Software Projects: Retrieving Software Development Q&A Posts Based on Project Task Similarity,”PLoS ONE, vol. 15, no. 12, e0243852, 2020

work page 2020

[15] [15]

Populating a Release History Database from Version Control and Bug Tracking Systems,

M. Fischer, M. Pinzger, and H. Gall, “Populating a Release History Database from Version Control and Bug Tracking Systems,” inProc. IEEE Int. Conf. Softw. Maintenance (ICSM), 2003, pp. 23–32

work page 2003

[16] [16]

On the Nature of Commits,

L. P. Hattori and M. Lanza, “On the Nature of Commits,” inProc. ASE Workshop on Mining Software Repositories, 2008, pp. 63–71

work page 2008

[17] [17]

Augmenting API Documentation with Insights from Stack Overflow,

C. Treude and M. P. Robillard, “Augmenting API Documentation with Insights from Stack Overflow,” inProc. Int. Conf. Softw. Eng. (ICSE), 2016, pp. 392–403

work page 2016

[18] [18]

R. J. Wieringa,Design Science Methodology for Information Systems and Software Engineering. Springer, 2014

work page 2014

[19] [19]

Wohlinet al.,Experimentation in Software Engineering

C. Wohlinet al.,Experimentation in Software Engineering. Springer, 2012

work page 2012

[20] [20]

The measurement of observer agreement for categorical data,

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

work page 1977

[21] [21]

When do changes induce fixes?,

J. ´Sliwerski, T. Zimmermann, and A. Zeller, “When do changes induce fixes?,” inProc. Int. Workshop on Mining Software Repositories (MSR), 2005, pp. 1–5

work page 2005

[22] [22]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023