pith. sign in

arxiv: 2605.21405 · v1 · pith:ZYQ3PIKBnew · submitted 2026-05-20 · 💻 cs.SE · cs.AI· cs.PL

Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries

Pith reviewed 2026-05-21 03:09 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PL
keywords python standard libraryllm code generationzero-dependency modulesperformance benchmarkingapi compatibilitydependency managementsoftware engineering
0
0 comments X

The pith

Stdlib-only reimplementations of popular Python libraries match or exceed third-party performance in most cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how much of the third-party Python library ecosystem can be replaced by single-file modules that use only the language standard library. The authors built a set of over 40 such modules spanning serialization, networking, cryptography, and other categories, each generated with LLM assistance under rules requiring no external imports, exact API matching, and verified correctness against the original. Benchmark results indicate these versions reach performance within a factor of two of the references for the bulk of cases. The main slowdowns appear only when the original library depends on compiled C extensions for tasks like image work or low-level crypto. In other areas the stdlib versions sometimes run many times faster by skipping extra layers present in the third-party code. This finding bears on efforts to reduce dependency management and supply-chain exposure in restricted deployment settings.

Core claim

Systematic benchmarking shows that stdlib-only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C-extension-backed computation (image processing, binary serialization, low-level crypto), not the inherent overhead of pure-Python third-party libraries. Conversely, many widely-used libraries carry architectural overhead that LLM-generated stdlib reimplementations avoid, yielding 5--115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories.

What carries the argument

LLM-assisted generation of single-file stdlib-only modules that enforce drop-in API compatibility and pass mandatory correctness checks against reference libraries.

If this is right

  • The standard library suffices for performance parity in the majority of non-C-extension library categories.
  • LLMs can generate correct, drop-in code under single-file and no-import constraints, though some categories require iterative human correction.
  • Architectural overhead present in certain third-party libraries can be removed to produce 5- to 115-fold speedups.
  • Dependency-free engineering at scale becomes practical for a broad range of common tasks once the capability boundary is mapped.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams working in air-gapped or high-security settings could adopt these reimplementations to shrink the attack surface from external packages.
  • The same constrained-generation technique may transfer to other languages that possess rich standard libraries.
  • Integration testing of the modules inside larger applications would reveal whether hidden edge cases survive the single-file constraint.
  • Cataloguing additional categories where the stdlib version wins on speed could inform future library design.

Load-bearing premise

The LLM-generated modules achieve mandatory correctness validation against the reference library while maintaining drop-in API compatibility under the constraints of no external imports and single-file structure.

What would settle it

A complete run of the reference test suites on the full collection that finds widespread failures in either output correctness or performance parity outside C-extension categories would show the approach does not hold at the claimed scale.

read the original abstract

Third-party Python libraries introduce dependency management overhead, supply chain risk, and deployment friction in constrained environments. A natural question is how much of this ecosystem can be replicated using only Python's standard library -- and at what correctness and performance cost. We address this empirically through zerodep, a growing collection of single-file Python modules, each a stdlib-only reimplementation of a popular third-party library, developed with LLM assistance under strict constraints: no external imports, single file, drop-in API compatibility, and mandatory correctness validation against the reference library. Spanning over 40 modules across 12 categories -- including serialization, networking, cryptography, agent protocols, and text processing -- zerodep provides a controlled testbed for two interrelated questions: (1) Where does the stdlib suffice? and (2) Can LLMs effectively generate correct, performant code under tight symbolic constraints? Systematic benchmarking shows that stdlib-only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C-extension-backed computation (image processing, binary serialization, low-level crypto), not the inherent overhead of pure-Python third-party libraries. Conversely, many widely-used libraries carry architectural overhead that LLM-generated stdlib reimplementations avoid, yielding 5--115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories, discuss where LLM-assisted development succeeds and where it requires iterative human correction, and examine implications for dependency-free software engineering at scale. zerodep is open-source at https://github.com/Oaklight/zerodep.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces zerodep, a collection of over 40 single-file, stdlib-only Python modules reimplementing popular third-party libraries across 12 categories (serialization, networking, cryptography, text processing, etc.). Developed with LLM assistance under constraints of no external imports, single-file structure, drop-in API compatibility, and mandatory correctness validation against references, the work empirically benchmarks performance and correctness. It claims that stdlib-only versions achieve parity (within 2x) in the majority of cases, with performance cliffs primarily at C-extension-backed tasks rather than pure-Python overhead, and reports speedups of 5-115x in some categories due to avoided architectural overhead. The paper also discusses LLM-assisted development outcomes and implications for dependency-free engineering.

Significance. If the empirical results and validation hold, the paper provides a useful controlled testbed and characterization of Python stdlib boundaries for common tasks, along with evidence on LLM code generation under tight constraints. The open-source release and systematic coverage across categories are strengths that could support further research on lightweight alternatives in constrained environments. This contributes to software engineering discussions on dependency management and supply-chain risks without requiring perfection in every module.

major comments (2)
  1. [§3] §3 (Methods): The claim of 'mandatory correctness validation against the reference library' plus 'drop-in API compatibility' under single-file/no-import constraints is load-bearing for all performance comparisons, yet the validation process is described only at a high level. Details are needed on test coverage (happy-path vs. edge cases such as malformed inputs in serialization/crypto modules), error-path equivalence, and full public API behavioral matching; without these, it is unclear whether measured modules solve equivalent problems.
  2. [§5] §5 (Benchmarking and Results): The headline finding that stdlib-only implementations achieve 'performance parity (within 2x of the reference) in the majority of cases' and that cliffs occur only for C-extension tasks requires explicit specification of benchmark workloads, input data selection criteria, number of repetitions, statistical aggregation, and how 'majority' is computed across the 40 modules. Absent these, the generalizability of the performance characterization cannot be assessed.
minor comments (1)
  1. [Abstract] Abstract and §1: The selection criteria for the 40 modules and 12 categories could be stated more explicitly to allow readers to judge representativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments correctly identify areas where additional methodological detail will strengthen the paper's claims. We have revised Sections 3 and 5 accordingly and provide point-by-point responses below.

read point-by-point responses
  1. Referee: §3 (Methods): The claim of 'mandatory correctness validation against the reference library' plus 'drop-in API compatibility' under single-file/no-import constraints is load-bearing for all performance comparisons, yet the validation process is described only at a high level. Details are needed on test coverage (happy-path vs. edge cases such as malformed inputs in serialization/crypto modules), error-path equivalence, and full public API behavioral matching; without these, it is unclear whether measured modules solve equivalent problems.

    Authors: We agree that the validation description in the original Section 3 was insufficiently detailed. In the revised manuscript we have expanded this section to specify: (1) that test suites were constructed by porting the reference library's public test cases where available and supplementing them with targeted edge-case tests (e.g., malformed JSON, invalid cryptographic inputs, boundary values for numeric parameters); (2) that error-path behavior was verified for equivalence by comparing exception types and messages on the same invalid inputs; and (3) that every public method and constructor in the documented API was exercised for both return-value and side-effect equivalence. These additions make the claim of drop-in compatibility verifiable from the text. revision: yes

  2. Referee: §5 (Benchmarking and Results): The headline finding that stdlib-only implementations achieve 'performance parity (within 2x of the reference) in the majority of cases' and that cliffs occur only for C-extension tasks requires explicit specification of benchmark workloads, input data selection criteria, number of repetitions, statistical aggregation, and how 'majority' is computed across the 40 modules. Absent these, the generalizability of the performance characterization cannot be assessed.

    Authors: We accept that the original benchmarking description lacked the requested operational details. The revised Section 5 now states: workloads were derived from the most common usage patterns documented in each library's README and examples; input data were generated with fixed random seeds using realistic distributions (e.g., 1–10 MB payloads for serialization, typical HTTP request sizes for networking); each benchmark was repeated 100 times after a 10-iteration warm-up; results are reported as median wall-clock time with interquartile range; and 'majority' is defined as more than half of the 40 modules (explicitly tallied per category and overall) falling within the 2× threshold. A new supplementary table lists these parameters for every module. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks against external references

full rationale

The paper reports an empirical study generating and benchmarking single-file stdlib reimplementations against independent third-party reference libraries for correctness and performance. All central claims (performance parity within 2x for majority of cases, performance cliffs at C-extension tasks) rest on direct measurements to external code and open-source validation, with no mathematical derivations, parameter fits presented as predictions, self-citations as load-bearing uniqueness theorems, or renamings of known results. The methodology is self-contained against external benchmarks and falsifiable outside the paper's own fitted values or definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Claims rest on the assumption that reference libraries provide authoritative behavior for validation and that the chosen modules and benchmarks are representative without selection bias.

axioms (2)
  • domain assumption Reference third-party libraries provide correct and complete API behavior for validation purposes.
    Invoked to establish correctness of the stdlib reimplementations.
  • domain assumption The selected modules and benchmarks across 12 categories are representative of real-world usage patterns.
    Required to generalize the performance parity and speedup findings.

pith-pipeline@v0.9.0 · 5828 in / 1304 out tokens · 33929 ms · 2026-05-21T03:09:52.122389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Bloat Beneath

    Drosos, Georgios-Petros and Sotiropoulos, Thodoris and Spinellis, Diomidis and others , booktitle =. Bloat Beneath

  2. [2]

    and Alexopoulos, G

    Karakatsanis, K. and Alexopoulos, G. and Karyotakis, I. and others , howpublished =

  3. [3]

    How Deep Does Your Dependency Tree Go? An Empirical Study of Dependency Amplification Across 10 Package Ecosystems , author =

  4. [4]

    An Empirical Study of Vulnerable Package Dependencies in

    Liu, Shuhan and Hu, Xing and Xia, Xin and Lo, David and Yang, Xiaohu , howpublished =. An Empirical Study of Vulnerable Package Dependencies in

  5. [5]

    Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , year =

    Machine Learning Systems are Bloated and Vulnerable , author =. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , year =

  6. [6]

    ACM Transactions on Software Engineering and Methodology , year =

    Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development , author =. ACM Transactions on Software Engineering and Methodology , year =

  7. [7]

    We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating

    Spracklen, Joseph and Wijewickrama, Raveen and Sakib, A H M Nazmus and Maiti, Anindya and Viswanath, Bimal and Jadliwala, Murtuza , booktitle =. We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating. 2025 , note =

  8. [8]

    Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

    Commit0: Library Generation from Scratch , author =. Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

  9. [9]

    Zhang, Zhaoxi and Xu, Yiming and Liang, Jiahui and Li, Weikang and Wu, Yunfang , howpublished =

  10. [10]

    Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) , year =

    Oracle-Guided Program Selection from Large Language Models , author =. Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) , year =

  11. [11]

    and Huang, K

    Wang, C. and Huang, K. and Zhang, J. and Feng, Y. and others , booktitle =

  12. [12]

    and Shen, B

    Yu, H. and Shen, B. and Ran, D. and Zhang, J. and others , booktitle =

  13. [13]

    pytest-benchmark , howpublished =

  14. [14]

    Pydantic , howpublished =

  15. [15]

    Pillow: The Friendly

  16. [16]

    tenacity , howpublished =