MEMRES: A Memory-Augmented Resolver with Confidence Cascade for Agentic Python Dependency Resolution

Dao Sy Duy Minh; Nguyen Lam Phu Quy; Pham Phu Hoa; Tran Chi Nguyen; Trung Kiet Huynh; Vu Nguyen

arxiv: 2604.16941 · v1 · submitted 2026-04-18 · 💻 cs.SE · cs.AI

MEMRES: A Memory-Augmented Resolver with Confidence Cascade for Agentic Python Dependency Resolution

Dao Sy Duy Minh , Tran Chi Nguyen , Trung Kiet Huynh , Pham Phu Hoa , Nguyen Lam Phu Quy , Vu Nguyen This is my paper

Pith reviewed 2026-05-10 06:37 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords Python dependency resolutionLLM agentsmemory augmentationconfidence cascadeimport errorserror pattern knowledge baseagentic systemscode execution

0 comments

The pith

MEMRES resolves Python dependency issues at 86.6 percent success by layering memory, patterns, and heuristics before calling the LLM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MEMRES as an agentic system that resolves Python import and dependency errors through a multi-level confidence cascade. Most cases are handled by self-evolving memory that reuses past fixes, a knowledge base of common error patterns, semantic analysis of imports, and a heuristic for Python 2 code. The LLM acts only as the final fallback. This design is evaluated on the HG2.9K benchmark of 2890 snippets using a 9B model, reaching an average success rate of 86.6 percent across ten runs. A reader would care because dependency errors frequently prevent code execution, and the cascade reduces dependence on large models or manual fixes.

Core claim

MEMRES is an agentic system for Python dependency resolution that introduces a multi-level confidence cascade where the LLM serves as the last resort. It combines a Self-Evolving Memory that accumulates reusable resolution patterns via tips and shortcuts, an Error Pattern Knowledge Base with 200+ curated import-to-package mappings, a Semantic Import Analyzer, and a Python 2 heuristic detector. On the HG2.9K benchmark using Gemma-2 9B, the system resolves 2503 of 2890 snippets for an 86.6 percent average success rate, exceeding the 54.7 percent rate of prior LLM-only methods.

What carries the argument

The multi-level confidence cascade that routes simple cases to memory and curated patterns, reserving the LLM for uncertain instances only.

If this is right

Smaller LLMs can reach high success rates on dependency tasks when supported by memory and rules.
Intra-session memory accumulation improves resolution of related issues within one context.
The approach lowers the frequency of expensive LLM calls for routine import problems.
Heuristics targeting Python 2 address one of the largest categories of prior failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory and pattern base could be shared or updated across users to cover emerging packages.
Similar cascade designs might transfer to dependency resolution in other languages like JavaScript or Rust.
Integration with package managers could allow the system to apply fixes automatically after resolution.

Load-bearing premise

The self-evolving memory and curated error mappings will continue to handle new and unseen Python projects without major manual updates or overfitting to the test set.

What would settle it

Running MEMRES on a fresh collection of real-world Python codebases and repositories outside the HG2.9K distribution and measuring whether the success rate remains near 86 percent.

Figures

Figures reproduced from arXiv: 2604.16941 by Dao Sy Duy Minh, Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Trung Kiet Huynh, Vu Nguyen.

**Figure 1.** Figure 1: MemRes pipeline. Each component exposes deterministic fast paths that bypass the LLM. 2.1 Intra-Session Memory MemRes first consults a Session Memory built incrementally during batch processing. Since PLLM evaluates sequentially within a single session, we emulate this by caching solutions proven successful for earlier code blocks to resolve near-duplicate datasets (e.g., identical GitHub forks). If a new … view at source ↗

read the original abstract

We present MEMRES, an agentic system for Python dependency resolution that introduces a multi-level confidence cascade where the LLM serves as the last resort. Our system combines: (1) a Self-Evolving Memory that accumulates reusable resolution patterns via tips and shortcuts; (2) an Error Pattern Knowledge Base with 200+ curated import-to-package mappings; (3) a Semantic Import Analyzer; and (4) a Python 2 heuristic detector resolving the largest failure category. On HG2.9K using Gemma-2 9B (10 GB VRAM). MEMRES resolves 2503 of 2890 (86.6%, 10-run average) snippets, combining intra-session memory with our confidence cascade for the remainder. This already exceeds PLLM's 54.7% overall success rate by a wide margin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MEMRES layers memory, a curated KB, and a cascade to cut LLM calls in Python dependency resolution and reports a big jump to 86.6% on HG2.9K, but the evaluation still needs clearer proof that the KB did not leak test-set patterns.

read the letter

The headline result is that MEMRES reaches 86.6% success on 2890 snippets using Gemma-2 9B by trying self-evolving memory and a 200-plus entry error KB first, then a semantic analyzer and Python 2 heuristic, before the model. That is a clear step up from the 54.7% reported for PLLM on the same set, and the cascade design itself is a reasonable way to keep a small model in the loop for a repetitive task like import fixing.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MEMRES, an agentic system for Python dependency resolution that augments an LLM with a self-evolving memory for reusable patterns, a curated Error Pattern Knowledge Base of 200+ import-to-package mappings, a Semantic Import Analyzer, and a Python 2 heuristic detector. It employs a multi-level confidence cascade that uses the LLM only as a last resort. On the HG2.9K dataset with Gemma-2 9B, the system is reported to resolve 2503 of 2890 snippets (86.6% average over 10 runs), substantially exceeding the PLLM baseline of 54.7%.

Significance. If the performance gains prove reproducible and attributable to the architecture rather than test-set leakage, the work could meaningfully advance practical agentic systems for software engineering by demonstrating how targeted memory and heuristics can reduce LLM reliance for common dependency errors. The approach offers a concrete path toward lower-cost, higher-reliability resolution in resource-constrained settings (10 GB VRAM).

major comments (2)

[Abstract] Abstract: the headline claim of 86.6% success (2503/2890) on HG2.9K is presented without any description of dataset construction, baseline re-implementations, error categorization, statistical significance, or variance across runs. This absence makes it impossible to assess whether the reported margin over PLLM's 54.7% is robust or an artifact of experimental setup.
[Error Pattern Knowledge Base] Error Pattern Knowledge Base (system description): the 200+ curated import-to-package mappings are load-bearing for the intra-session memory and cascade results, yet the manuscript supplies no information on their provenance, construction process, or independence from the HG2.9K distribution. Without explicit confirmation that the KB was built on held-out or external data, the performance delta cannot be confidently attributed to the proposed architecture rather than leakage or overfitting.

minor comments (2)

[Abstract] The abstract states that the system 'combines intra-session memory with our confidence cascade for the remainder' but does not define the cascade thresholds, ordering of components, or failure modes that trigger each level.
Terms such as 'Self-Evolving Memory' and 'Semantic Import Analyzer' are introduced without pseudocode, algorithmic outlines, or concrete examples of how they interact during a resolution session.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments highlight important areas for improving clarity and reproducibility, and we address each point below with specific plans for revision.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 86.6% success (2503/2890) on HG2.9K is presented without any description of dataset construction, baseline re-implementations, error categorization, statistical significance, or variance across runs. This absence makes it impossible to assess whether the reported margin over PLLM's 54.7% is robust or an artifact of experimental setup.

Authors: We agree that the abstract, constrained by length, omits key contextual details that would help readers evaluate the results at a glance. The full manuscript already describes the HG2.9K dataset construction (Section 3), PLLM baseline re-implementation (Section 4), error categorization (Section 5), and reports the 10-run average with variance (Section 5). To directly address the concern, we will revise the abstract to include concise statements on dataset scale and origin, the multi-run averaging procedure, and a brief note on the performance margin's consistency. This change will make the headline claim more self-contained while preserving the abstract's focus on the core contribution. revision: yes
Referee: [Error Pattern Knowledge Base] Error Pattern Knowledge Base (system description): the 200+ curated import-to-package mappings are load-bearing for the intra-session memory and cascade results, yet the manuscript supplies no information on their provenance, construction process, or independence from the HG2.9K distribution. Without explicit confirmation that the KB was built on held-out or external data, the performance delta cannot be confidently attributed to the proposed architecture rather than leakage or overfitting.

Authors: We acknowledge that the current manuscript does not provide sufficient detail on the Error Pattern Knowledge Base, which is necessary to substantiate that performance gains stem from the architecture rather than data overlap. The KB was constructed independently from external public sources, including Python standard library documentation, PyPI package metadata, and aggregated import-error patterns drawn from open community resources and forums, with curation completed prior to any exposure to the HG2.9K snippets. No test-set examples were used in its development. We will add a dedicated paragraph in the system description section that explicitly details the provenance, manual curation process, and verification steps confirming independence from the evaluation distribution. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical performance report with no derivations or self-referential claims

full rationale

The paper describes an agentic system and reports direct empirical success rates (2503/2890 on HG2.9K) without any equations, derivations, fitted parameters, or mathematical predictions. The listed components (Self-Evolving Memory, Error Pattern Knowledge Base, etc.) are architectural elements whose performance is measured, not quantities derived from themselves by construction. No self-citation chains, ansatzes, or uniqueness theorems appear as load-bearing steps. Concerns about KB curation or generalization are validity issues, not circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on the unverified effectiveness of four invented components and the representativeness of the HG2.9K benchmark; no free parameters or mathematical axioms are stated.

axioms (1)

domain assumption Python dependency resolution failures can be systematically categorized and addressed via memory, curated mappings, semantic analysis, and heuristics before LLM fallback.
Implicit foundation for the confidence cascade design.

invented entities (4)

Self-Evolving Memory no independent evidence
purpose: Accumulates reusable resolution patterns via tips and shortcuts
New component introduced to store intra-session knowledge.
Error Pattern Knowledge Base no independent evidence
purpose: Provides 200+ curated import-to-package mappings
Curated resource for common errors.
Semantic Import Analyzer no independent evidence
purpose: Analyzes imports semantically to aid resolution
New analysis module.
Python 2 heuristic detector no independent evidence
purpose: Resolves the largest failure category
Specialized heuristic for legacy code.

pith-pipeline@v0.9.0 · 5458 in / 1459 out tokens · 42204 ms · 2026-05-10T06:37:37.778273+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

The Last Dependency Crusade: Solving Python Dependency Conflicts with LLMs

A. Bartlett, C. Liem, and A. Panichella. “The Last Dependency Crusade: Solving Python Dependency Conflicts with LLMs. ” In Proc. of the IEEE/ACM Automated Software Engineering Workshop (ASEW), pp. 66–73, 2025

work page 2025
[2]

DockerizeMe: Automatic Inference of Environment Dependencies for Python Code Snippets

E. Horton and C. Parnin. “DockerizeMe: Automatic Inference of Environment Dependencies for Python Code Snippets. ” In Proc. of the ACM/IEEE International Conference on Software Engineering (ICSE), pp. 328–338, 2019

work page 2019
[3]

An Empirical Study of Dependency Conflicts in the Python Ecosystem

Y. Jia, J. Han, J. Cao, Y. Zhou, and B. Xu. “An Empirical Study of Dependency Conflicts in the Python Ecosystem. ”IEEE Trans. Softw. Eng., vol. 50, no. 8, pp. 2125–2140, 2024

work page 2024
[4]

Mobile- Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Z. Wang, H. Xu, J. Wang, X. Zhang, M. Yan, J. Zhang, F. Huang, and H. Ji. “Mobile- Agent-E: Self-Evolving Mobile Assistant for Complex Tasks. ” In Proc. of the NeurIPS Workshop on Scaling Environments for Agents (SEA), 2025

work page 2025
[5]

Re- flexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. “Re- flexion: Language Agents with Verbal Reinforcement Learning. ” In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[6]

Gemma 2: Improving Open Language Models at a Practical Size

Google DeepMind. “Gemma 2: Improving Open Language Models at a Practical Size. ” Tech. Rep., Google DeepMind, 2024

work page 2024
[7]

pipreqs: Generate pip requirements.txt based on imports,

V. Kravcenko. “pipreqs: Generate pip requirements.txt based on imports, ” 2015. [Online]. Available: https://github.com/bndr/pipreqs

work page 2015
[8]

DepsRAG: Towards Managing Software Dependencies using LLMs

M. Alhanahnah, Y. Boshmaf, and B. Baudry. “DepsRAG: Towards Managing Software Dependencies using LLMs. ” In Proc. of the NeurIPS 2024 Workshop, 2024

work page 2024
[9]

Knowledge-Based Environment Dependency Inference for Python Programs

H. Ye, W. Chen, W. Dou, G. Wu, and J. Wei. “Knowledge-Based Environment Dependency Inference for Python Programs. ” In Proc. of the ACM/IEEE Interna- tional Conference on Software Engineering (ICSE), pp. 1245–1256, 2022

work page 2022
[10]

ReadPyE: Revisiting Knowledge-Based Inference of Python Runtime Environments

W. Cheng, W. Hu, and X. Ma. “ReadPyE: Revisiting Knowledge-Based Inference of Python Runtime Environments. ”IEEE Trans. Softw. Eng., vol. 50, no. 2, pp. 258– 279, 2024

work page 2024
[11]

An Empirical Study on Python Library Dependency and Conflict Issues

X. Jia, Y. Zhou, Y. Hussain, and W. Yang. “An Empirical Study on Python Library Dependency and Conflict Issues. ” In Proc. of the IEEE International Conference on Software Quality, Reliability and Security (QRS), 2024

work page 2024
[12]

DependEval: Benchmarking LLMs for Repository Dependency Understanding

J. Du, Y. Liu, H. Guo, et al. “DependEval: Benchmarking LLMs for Repository Dependency Understanding. ” In Proc. of Findings of ACL, 2025

work page 2025
[13]

V2: Fast Detection of Configuration Drift in Python

E. Horton and C. Parnin. “V2: Fast Detection of Configuration Drift in Python. ” In Proc. of the IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 814–819, 2019

work page 2019
[14]

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

R. Hu, C. Peng, X. Wang, J. Xu, and C. Gao. “Repo2Run: Automated Building Executable Environment for Code Repository at Scale. ” In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[1] [1]

The Last Dependency Crusade: Solving Python Dependency Conflicts with LLMs

A. Bartlett, C. Liem, and A. Panichella. “The Last Dependency Crusade: Solving Python Dependency Conflicts with LLMs. ” In Proc. of the IEEE/ACM Automated Software Engineering Workshop (ASEW), pp. 66–73, 2025

work page 2025

[2] [2]

DockerizeMe: Automatic Inference of Environment Dependencies for Python Code Snippets

E. Horton and C. Parnin. “DockerizeMe: Automatic Inference of Environment Dependencies for Python Code Snippets. ” In Proc. of the ACM/IEEE International Conference on Software Engineering (ICSE), pp. 328–338, 2019

work page 2019

[3] [3]

An Empirical Study of Dependency Conflicts in the Python Ecosystem

Y. Jia, J. Han, J. Cao, Y. Zhou, and B. Xu. “An Empirical Study of Dependency Conflicts in the Python Ecosystem. ”IEEE Trans. Softw. Eng., vol. 50, no. 8, pp. 2125–2140, 2024

work page 2024

[4] [4]

Mobile- Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Z. Wang, H. Xu, J. Wang, X. Zhang, M. Yan, J. Zhang, F. Huang, and H. Ji. “Mobile- Agent-E: Self-Evolving Mobile Assistant for Complex Tasks. ” In Proc. of the NeurIPS Workshop on Scaling Environments for Agents (SEA), 2025

work page 2025

[5] [5]

Re- flexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. “Re- flexion: Language Agents with Verbal Reinforcement Learning. ” In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[6] [6]

Gemma 2: Improving Open Language Models at a Practical Size

Google DeepMind. “Gemma 2: Improving Open Language Models at a Practical Size. ” Tech. Rep., Google DeepMind, 2024

work page 2024

[7] [7]

pipreqs: Generate pip requirements.txt based on imports,

V. Kravcenko. “pipreqs: Generate pip requirements.txt based on imports, ” 2015. [Online]. Available: https://github.com/bndr/pipreqs

work page 2015

[8] [8]

DepsRAG: Towards Managing Software Dependencies using LLMs

M. Alhanahnah, Y. Boshmaf, and B. Baudry. “DepsRAG: Towards Managing Software Dependencies using LLMs. ” In Proc. of the NeurIPS 2024 Workshop, 2024

work page 2024

[9] [9]

Knowledge-Based Environment Dependency Inference for Python Programs

H. Ye, W. Chen, W. Dou, G. Wu, and J. Wei. “Knowledge-Based Environment Dependency Inference for Python Programs. ” In Proc. of the ACM/IEEE Interna- tional Conference on Software Engineering (ICSE), pp. 1245–1256, 2022

work page 2022

[10] [10]

ReadPyE: Revisiting Knowledge-Based Inference of Python Runtime Environments

W. Cheng, W. Hu, and X. Ma. “ReadPyE: Revisiting Knowledge-Based Inference of Python Runtime Environments. ”IEEE Trans. Softw. Eng., vol. 50, no. 2, pp. 258– 279, 2024

work page 2024

[11] [11]

An Empirical Study on Python Library Dependency and Conflict Issues

X. Jia, Y. Zhou, Y. Hussain, and W. Yang. “An Empirical Study on Python Library Dependency and Conflict Issues. ” In Proc. of the IEEE International Conference on Software Quality, Reliability and Security (QRS), 2024

work page 2024

[12] [12]

DependEval: Benchmarking LLMs for Repository Dependency Understanding

J. Du, Y. Liu, H. Guo, et al. “DependEval: Benchmarking LLMs for Repository Dependency Understanding. ” In Proc. of Findings of ACL, 2025

work page 2025

[13] [13]

V2: Fast Detection of Configuration Drift in Python

E. Horton and C. Parnin. “V2: Fast Detection of Configuration Drift in Python. ” In Proc. of the IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 814–819, 2019

work page 2019

[14] [14]

Repo2Run: Automated Building Executable Environment for Code Repository at Scale

R. Hu, C. Peng, X. Wang, J. Xu, and C. Gao. “Repo2Run: Automated Building Executable Environment for Code Repository at Scale. ” In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025