Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries
Pith reviewed 2026-05-21 03:09 UTC · model grok-4.3
The pith
Stdlib-only reimplementations of popular Python libraries match or exceed third-party performance in most cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Systematic benchmarking shows that stdlib-only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C-extension-backed computation (image processing, binary serialization, low-level crypto), not the inherent overhead of pure-Python third-party libraries. Conversely, many widely-used libraries carry architectural overhead that LLM-generated stdlib reimplementations avoid, yielding 5--115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories.
What carries the argument
LLM-assisted generation of single-file stdlib-only modules that enforce drop-in API compatibility and pass mandatory correctness checks against reference libraries.
If this is right
- The standard library suffices for performance parity in the majority of non-C-extension library categories.
- LLMs can generate correct, drop-in code under single-file and no-import constraints, though some categories require iterative human correction.
- Architectural overhead present in certain third-party libraries can be removed to produce 5- to 115-fold speedups.
- Dependency-free engineering at scale becomes practical for a broad range of common tasks once the capability boundary is mapped.
Where Pith is reading between the lines
- Teams working in air-gapped or high-security settings could adopt these reimplementations to shrink the attack surface from external packages.
- The same constrained-generation technique may transfer to other languages that possess rich standard libraries.
- Integration testing of the modules inside larger applications would reveal whether hidden edge cases survive the single-file constraint.
- Cataloguing additional categories where the stdlib version wins on speed could inform future library design.
Load-bearing premise
The LLM-generated modules achieve mandatory correctness validation against the reference library while maintaining drop-in API compatibility under the constraints of no external imports and single-file structure.
What would settle it
A complete run of the reference test suites on the full collection that finds widespread failures in either output correctness or performance parity outside C-extension categories would show the approach does not hold at the claimed scale.
read the original abstract
Third-party Python libraries introduce dependency management overhead, supply chain risk, and deployment friction in constrained environments. A natural question is how much of this ecosystem can be replicated using only Python's standard library -- and at what correctness and performance cost. We address this empirically through zerodep, a growing collection of single-file Python modules, each a stdlib-only reimplementation of a popular third-party library, developed with LLM assistance under strict constraints: no external imports, single file, drop-in API compatibility, and mandatory correctness validation against the reference library. Spanning over 40 modules across 12 categories -- including serialization, networking, cryptography, agent protocols, and text processing -- zerodep provides a controlled testbed for two interrelated questions: (1) Where does the stdlib suffice? and (2) Can LLMs effectively generate correct, performant code under tight symbolic constraints? Systematic benchmarking shows that stdlib-only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C-extension-backed computation (image processing, binary serialization, low-level crypto), not the inherent overhead of pure-Python third-party libraries. Conversely, many widely-used libraries carry architectural overhead that LLM-generated stdlib reimplementations avoid, yielding 5--115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories, discuss where LLM-assisted development succeeds and where it requires iterative human correction, and examine implications for dependency-free software engineering at scale. zerodep is open-source at https://github.com/Oaklight/zerodep.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces zerodep, a collection of over 40 single-file, stdlib-only Python modules reimplementing popular third-party libraries across 12 categories (serialization, networking, cryptography, text processing, etc.). Developed with LLM assistance under constraints of no external imports, single-file structure, drop-in API compatibility, and mandatory correctness validation against references, the work empirically benchmarks performance and correctness. It claims that stdlib-only versions achieve parity (within 2x) in the majority of cases, with performance cliffs primarily at C-extension-backed tasks rather than pure-Python overhead, and reports speedups of 5-115x in some categories due to avoided architectural overhead. The paper also discusses LLM-assisted development outcomes and implications for dependency-free engineering.
Significance. If the empirical results and validation hold, the paper provides a useful controlled testbed and characterization of Python stdlib boundaries for common tasks, along with evidence on LLM code generation under tight constraints. The open-source release and systematic coverage across categories are strengths that could support further research on lightweight alternatives in constrained environments. This contributes to software engineering discussions on dependency management and supply-chain risks without requiring perfection in every module.
major comments (2)
- [§3] §3 (Methods): The claim of 'mandatory correctness validation against the reference library' plus 'drop-in API compatibility' under single-file/no-import constraints is load-bearing for all performance comparisons, yet the validation process is described only at a high level. Details are needed on test coverage (happy-path vs. edge cases such as malformed inputs in serialization/crypto modules), error-path equivalence, and full public API behavioral matching; without these, it is unclear whether measured modules solve equivalent problems.
- [§5] §5 (Benchmarking and Results): The headline finding that stdlib-only implementations achieve 'performance parity (within 2x of the reference) in the majority of cases' and that cliffs occur only for C-extension tasks requires explicit specification of benchmark workloads, input data selection criteria, number of repetitions, statistical aggregation, and how 'majority' is computed across the 40 modules. Absent these, the generalizability of the performance characterization cannot be assessed.
minor comments (1)
- [Abstract] Abstract and §1: The selection criteria for the 40 modules and 12 categories could be stated more explicitly to allow readers to judge representativeness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The two major comments correctly identify areas where additional methodological detail will strengthen the paper's claims. We have revised Sections 3 and 5 accordingly and provide point-by-point responses below.
read point-by-point responses
-
Referee: §3 (Methods): The claim of 'mandatory correctness validation against the reference library' plus 'drop-in API compatibility' under single-file/no-import constraints is load-bearing for all performance comparisons, yet the validation process is described only at a high level. Details are needed on test coverage (happy-path vs. edge cases such as malformed inputs in serialization/crypto modules), error-path equivalence, and full public API behavioral matching; without these, it is unclear whether measured modules solve equivalent problems.
Authors: We agree that the validation description in the original Section 3 was insufficiently detailed. In the revised manuscript we have expanded this section to specify: (1) that test suites were constructed by porting the reference library's public test cases where available and supplementing them with targeted edge-case tests (e.g., malformed JSON, invalid cryptographic inputs, boundary values for numeric parameters); (2) that error-path behavior was verified for equivalence by comparing exception types and messages on the same invalid inputs; and (3) that every public method and constructor in the documented API was exercised for both return-value and side-effect equivalence. These additions make the claim of drop-in compatibility verifiable from the text. revision: yes
-
Referee: §5 (Benchmarking and Results): The headline finding that stdlib-only implementations achieve 'performance parity (within 2x of the reference) in the majority of cases' and that cliffs occur only for C-extension tasks requires explicit specification of benchmark workloads, input data selection criteria, number of repetitions, statistical aggregation, and how 'majority' is computed across the 40 modules. Absent these, the generalizability of the performance characterization cannot be assessed.
Authors: We accept that the original benchmarking description lacked the requested operational details. The revised Section 5 now states: workloads were derived from the most common usage patterns documented in each library's README and examples; input data were generated with fixed random seeds using realistic distributions (e.g., 1–10 MB payloads for serialization, typical HTTP request sizes for networking); each benchmark was repeated 100 times after a 10-iteration warm-up; results are reported as median wall-clock time with interquartile range; and 'majority' is defined as more than half of the 40 modules (explicitly tallied per category and overall) falling within the 2× threshold. A new supplementary table lists these parameters for every module. revision: yes
Circularity Check
No circularity: empirical benchmarks against external references
full rationale
The paper reports an empirical study generating and benchmarking single-file stdlib reimplementations against independent third-party reference libraries for correctness and performance. All central claims (performance parity within 2x for majority of cases, performance cliffs at C-extension tasks) rest on direct measurements to external code and open-source validation, with no mathematical derivations, parameter fits presented as predictions, self-citations as load-bearing uniqueness theorems, or renamings of known results. The methodology is self-contained against external benchmarks and falsifiable outside the paper's own fitted values or definitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Reference third-party libraries provide correct and complete API behavior for validation purposes.
- domain assumption The selected modules and benchmarks across 12 categories are representative of real-world usage patterns.
Reference graph
Works this paper leans on
-
[1]
Drosos, Georgios-Petros and Sotiropoulos, Thodoris and Spinellis, Diomidis and others , booktitle =. Bloat Beneath
-
[2]
Karakatsanis, K. and Alexopoulos, G. and Karyotakis, I. and others , howpublished =
-
[3]
How Deep Does Your Dependency Tree Go? An Empirical Study of Dependency Amplification Across 10 Package Ecosystems , author =
-
[4]
An Empirical Study of Vulnerable Package Dependencies in
Liu, Shuhan and Hu, Xing and Xia, Xin and Lo, David and Yang, Xiaohu , howpublished =. An Empirical Study of Vulnerable Package Dependencies in
-
[5]
Machine Learning Systems are Bloated and Vulnerable , author =. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , year =
-
[6]
ACM Transactions on Software Engineering and Methodology , year =
Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development , author =. ACM Transactions on Software Engineering and Methodology , year =
-
[7]
We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating
Spracklen, Joseph and Wijewickrama, Raveen and Sakib, A H M Nazmus and Maiti, Anindya and Viswanath, Bimal and Jadliwala, Murtuza , booktitle =. We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating. 2025 , note =
work page 2025
-
[8]
Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =
Commit0: Library Generation from Scratch , author =. Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =
-
[9]
Zhang, Zhaoxi and Xu, Yiming and Liang, Jiahui and Li, Weikang and Wu, Yunfang , howpublished =
-
[10]
Oracle-Guided Program Selection from Large Language Models , author =. Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) , year =
- [11]
- [12]
-
[13]
pytest-benchmark , howpublished =
-
[14]
Pydantic , howpublished =
-
[15]
Pillow: The Friendly
-
[16]
tenacity , howpublished =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.