pith. sign in

arxiv: 2603.16011 · v2 · pith:URXZU7UMnew · submitted 2026-03-16 · 💻 cs.SE · cs.AI· cs.CL

FormulaCode: Evaluating Agentic Optimization on Large Codebases

Pith reviewed 2026-05-21 10:38 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords LLM agentscode optimizationbenchmarkrepository-scalemulti-objective evaluationscientific PythonGitHub repositoriesperformance workloads
0
0 comments X

The pith

FormulaCode shows that frontier LLM agents still struggle to optimize entire real-world codebases under multiple performance goals at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FormulaCode to test LLM coding agents on large code repositories rather than small synthetic problems. It collects 957 performance bottlenecks from actual scientific Python projects on GitHub, supplies expert-written fixes for each, and pairs them with hundreds of community test workloads that measure both correctness and speed. This setup lets researchers score agents on realistic, multi-objective optimization instead of binary pass-fail checks. Evaluations on current top agents indicate they rarely succeed at making the required improvements across the full set of tasks. If the benchmark holds, it points to a concrete gap between what agents can do on toy problems and what is needed for production-scale code improvement.

Core claim

FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and on average 264.6 community-maintained performance workloads per task, and evaluations on this benchmark reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents.

What carries the argument

The FormulaCode benchmark of 957 mined bottlenecks, each supplied with expert patches and hundreds of community workloads for multi-objective scoring.

If this is right

  • Agent designs must incorporate mechanisms that handle repository-wide changes while tracking multiple performance metrics simultaneously.
  • Future benchmarks should move away from single-objective or synthetic tasks toward real mined workloads with expert reference solutions.
  • Progress on agentic code optimization will require new methods for balancing correctness constraints against speed and resource goals at scale.
  • Developers relying on agents for performance work will continue to need substantial human oversight on large codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended to other languages or application domains by applying the same mining and workload collection process.
  • Success on FormulaCode tasks might serve as a training signal for fine-tuning agents specifically on performance-tuning behavior.
  • If agents improve here, the same evaluation approach could be reused to measure gains in related areas such as security hardening or memory reduction.

Load-bearing premise

The 957 mined bottlenecks, expert-authored patches, and average 264.6 community workloads per task accurately capture holistic optimization challenges under realistic correctness and performance constraints.

What would settle it

A frontier LLM agent that produces changes matching or exceeding the expert patches on correctness and performance metrics for a large majority of the 957 tasks across their respective community workloads would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.16011 by Akanksha Sarkar, Atharva Sehgal, Ishaan Mantripragada, James Hou, Jennifer J. Sun, Swarat Chaudhuri, Yisong Yue.

Figure 1
Figure 1. Figure 1: FORMULACODE is a continuously updating benchmark for evaluating the holistic ability of agents to optimize large codebases. Each task in FORMULACODE comprises a problem description of a performance regression from GitHub, an environment containing a baseline repository snapshot, and multiple expert-written crowdsourced performance workloads, along with the tools to execute them. An agent’s performance impr… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of FORMULACODE construction pipeline. FORMULACODE follows a four stage pipeline to identify real-world performance optimization tasks. (1) Scrape compliant repositories (§A.1.1). (2) Apply rule-based and LLM￾based filters to identify candidate performance improvement pull requests (§A.1.2). (3) Construct reproducible Docker environments for each candidate (§A.1.3). (4) Validate each candidate for … view at source ↗
Figure 3
Figure 3. Figure 3: Showing stratified advantage across hierarchy lev [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cost-Performance tradeoff of agent-model con [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-workload tradeoff performance of agent [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of tasks across repositories in [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template used by the LLM-based performance intent classifier described in [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of performance intent classification for a real pull request [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template used by the LLM-based classifier for assigning each performance task an optimization category [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt structure for the docker build agent ( [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example application of the optimization type and difficulty classifier (Figure [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Structured DSPy prompt used to judge whether a pull request is primarily intended to improve runtime or product [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Structured DSPy prompt used to extract the underlying problem and resolution context from a GitHub pull request. [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Overview of the pipeline for Docker environment synthesis. The system reuses chronologically adjacent build [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Timeline of FORMULACODE tasks organized by the date the expert-patch was merged till November, 2025. Each box represents the number of expert-patch tasks merged during a particular month/year. FORMULACODE is updated on the 31st of each month, and our most recent task is from 2025-11-21. The dataset grows by 20.25 tasks per month on average, facilitating contamination analyses for performance-optimization … view at source ↗
Figure 16
Figure 16. Figure 16: modin_project-modin_2: Modin’s AutoSwitchBackend feature, enabled by default, triggered an expensive type conversion even when all inputs shared the same backend. The agent solution (openhands:claude-sonnet-4) identified and fixed a real bug in the caching logic, but this was not on the performance-critical path, resulting in a −0.1265 advantage compared to the human expert’s systemic fix that disabled Au… view at source ↗
Figure 17
Figure 17. Figure 17: optuna_optuna_6: Optuna’s _hypervolume.WFG class used a naive recursive algorithm for hypervolume computation that had a O(N3 ) runtime for the common 3D case, when a O(N2 ) approach was possible. Both the human and the agent identified and implemented the faster algorithm. However, the human’s solution used fully vectorized numpy operations, while the best agent (terminus-2:gpt-5) used a Python-level swe… view at source ↗
Figure 18
Figure 18. Figure 18: optuna_optuna_1: The original implementation of Optuna’s non-dominated sorting in multi-objective optimiza￾tion cases emerged as a performance bottleneck when scaling to large number of trials (∼ 10000 trials). Both the best agent (terminus-2:gpt-5) and the human expert correctly identified the issue. The agent’s solution focused on optimizing the inner ranking function with a specialized O(n log n) Fenwi… view at source ↗
Figure 19
Figure 19. Figure 19: networkx_networkx_4: NetworkX’s connected_components and weakly_connected_components passed the total graph node count n to _plain_bfs without accounting for already-discovered nodes, missing an early-termination optimization. For disconnected graphs with large components explored last, this caused dramatic slowdowns—up to 367× for adversarial cases with n=1000. Both the best agent (openhands:gpt-5) and t… view at source ↗
Figure 20
Figure 20. Figure 20: pybamm_team-pybamm_1: PyBaMM’s ProcessedVariable sensitivity computation in IDAKLUSolver used an incremental per-timestep concatenation operation, creating a quadratic memory allocation overhead. Both the best agent (openhands:gpt-5) and the expert identified that, instead of each loop iteration building a progressively larger matrix by concatenating to the existing result, it would be more efficient to f… view at source ↗
Figure 21
Figure 21. Figure 21: shapely_shapely_1: The deprecate_positional decorator in Shapely called inspect.signature and sig.bind_partial on every decorated function invocation, causing a 300–1000% performance regression. Users reported significant Polygon creation slowdowns. The best agent (terminus-2:claude-sonnet-4) and the human expert converged on nearly identical core strategies. Both implemented a caching layer to move signa… view at source ↗
Figure 22
Figure 22. Figure 22: Illustration of Hierarchical Grouping of Pandas Workloads. By construction, each workload in [PITH_FULL_IMAGE:figures/full_fig_p049_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Visual intuition for Agent Advantage (Advagent; §2). Each cross (✗) represents an individual workload using the expert-derived speedup (speedupexpert) and the agent-derived speedup (speedupagent). The identity function line represents equal advantage (i.e., speedupexpert = speedupagent). Then, the agent advantage is the mean weighted deviation from the equal advantage line. The plot also showcases four op… view at source ↗
Figure 24
Figure 24. Figure 24: Visualization of advantage for Terminus 2 Agents. Refer to Figure [PITH_FULL_IMAGE:figures/full_fig_p051_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Example task in FORMULACODE for Qiskit/qiskit (PR: https://github.com/Qiskit/qiskit/pull/ 14782). The prompt presents a complete optimization task, including the performance goal, the benchmarking and profiling tools (Pytest and ASV), a structured optimization workflow, and concrete repository context with motivating performance observations. The “Relevant Issues” section contains GitHub issues that are d… view at source ↗
Figure 26
Figure 26. Figure 26: Example task in FORMULACODE for shapely/shapely (PR: https://github.com/shapely/shapely/pull/ 2283) [PITH_FULL_IMAGE:figures/full_fig_p053_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Example task in FORMULACODE for pandas-dev/pandas (PR: https://github.com/pandas-dev/pandas/ pull/59608) [PITH_FULL_IMAGE:figures/full_fig_p054_27.png] view at source ↗
read the original abstract

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FormulaCode, a benchmark for evaluating LLM coding agents on repository-scale optimization. It comprises 957 performance bottlenecks mined from scientific Python GitHub repositories, each paired with expert-authored patches and an average of 264.6 community-maintained performance workloads per task. The work evaluates frontier agents under multi-objective correctness and performance constraints and concludes that such optimization remains a major challenge.

Significance. If the benchmark tasks prove representative, this provides a valuable advance over synthetic or single-objective code benchmarks by grounding evaluation in real repositories, expert patches, and community workloads. The multi-objective framing and scale could help identify concrete limitations in current agentic systems and motivate more robust optimization techniques.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The description of mining the 957 bottlenecks, validating expert patches (correctness tests plus measured speedup), and selecting the 264.6 workloads per task lacks explicit criteria, filtering rules, or statistical justification for representativeness. This is load-bearing for the central claim, as the reported agent failures only demonstrate a general challenge if the tasks are authentic proxies rather than biased toward easily detectable or low-impact cases.
  2. [§5] §5 (Evaluations): The manuscript does not report the number of independent runs, error bars, statistical significance tests, or the precise aggregation method for multi-objective scores (e.g., how correctness, speedup, and other metrics are combined). Without these, it is difficult to verify that the data robustly supports the conclusion that repository-scale optimization is a major challenge for frontier agents.
minor comments (2)
  1. [Introduction] The abstract and introduction should include a brief comparison table or explicit discussion of how FormulaCode differs from related benchmarks such as SWE-bench in terms of objectives and scale.
  2. [Results] Figure captions and axis labels in the results section could be clarified to indicate whether reported metrics are averages across workloads or per-bottleneck aggregates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. The comments highlight important aspects of benchmark construction and evaluation reporting that we will address in revision. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The description of mining the 957 bottlenecks, validating expert patches (correctness tests plus measured speedup), and selecting the 264.6 workloads per task lacks explicit criteria, filtering rules, or statistical justification for representativeness. This is load-bearing for the central claim, as the reported agent failures only demonstrate a general challenge if the tasks are authentic proxies rather than biased toward easily detectable or low-impact cases.

    Authors: We agree that the current description in §3 would benefit from greater explicitness. In the revised manuscript we will add a new subsection that specifies the exact mining criteria (performance bottlenecks identified via profiling with measurable speedup potential and at least one community workload), the filtering rules applied (e.g., exclusion of tasks without reproducible correctness tests or with <5% potential improvement), the patch validation protocol (unit-test passage plus measured wall-clock speedup on held-out workloads), and statistical justification for representativeness (distribution of repository sizes, scientific domains, and performance-impact quantiles across the 957 tasks). These additions will clarify that the tasks constitute authentic proxies rather than a biased subset. revision: yes

  2. Referee: [§5] §5 (Evaluations): The manuscript does not report the number of independent runs, error bars, statistical significance tests, or the precise aggregation method for multi-objective scores (e.g., how correctness, speedup, and other metrics are combined). Without these, it is difficult to verify that the data robustly supports the conclusion that repository-scale optimization is a major challenge for frontier agents.

    Authors: We acknowledge that these methodological details were omitted. In the revised §5 we will report the exact number of independent runs per agent-task pair, include error bars on all figures and tables, describe the statistical tests performed (e.g., Wilcoxon signed-rank tests with Bonferroni correction for pairwise agent comparisons), and provide the precise multi-objective aggregation formula (a weighted combination of correctness rate and geometric-mean speedup, with sensitivity analysis to alternative weightings). These changes will allow readers to assess the robustness of the reported challenge. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark paper with external grounding and no derivations or self-referential predictions

full rationale

The paper introduces FormulaCode, a benchmark of 957 bottlenecks mined from GitHub scientific Python repositories, paired with expert patches and community workloads (average 264.6 per task). The central claim—that repository-scale multi-objective optimization remains a major challenge—is an empirical observation from agent evaluations on these externally sourced tasks. No equations, fitted parameters, predictions, uniqueness theorems, or self-citations are described that would reduce any result to the inputs by construction. The work is self-contained as a benchmark introduction relying on independent GitHub data and community contributions rather than internal definitions or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the mined tasks and workloads as proxies for real optimization challenges; this is a domain assumption not independently validated in the abstract.

axioms (1)
  • domain assumption Mined performance bottlenecks from GitHub scientific Python repositories, paired with expert patches and community workloads, constitute a valid test of holistic agentic optimization.
    Invoked to support the claim that evaluations reveal a major challenge for frontier agents.

pith-pipeline@v0.9.0 · 5701 in / 1149 out tokens · 41119 ms · 2026-05-21T10:38:57.992259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

189 extracted references · 189 canonical work pages · 1 internal anchor

  1. [1]

    Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, and et al

    URLhttps://arxiv.org/abs/2105.12655. Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M. P., Dupont, E., Ruiz, F. J., Ellenberg, J. S., Wang, P., Fawzi, O., et al. Mathematical discoveries from program search with large language models.Nature, 625 (7995):468–475, 2024. Sasnauskas, R., Chen, Y ., Collingbourne, P., Ketema, J., Lup, G., Ta...

  2. [2]

    URL http://www

    ISSN 01761714, 1432217X. URL http://www. jstor.org/stable/41105866. Tratt, L. Four kinds of optimisation, November 2023. URL https://tratt.net/laurie/blog/2023/four_ kinds_of_optimisation.html. Tratt, L. The fifth kind of optimisation, April

  3. [3]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    URL https://tratt.net/laurie/blog/2025/ the_fifth_kind_of_optimisation.html. Waghjale, S., Veerendranath, V ., Wang, Z., and Fried, D. ECCO: Can we improve model-generated code efficiency without sacrificing functional correctness? In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural La...

  4. [4]

    thinking

    GPT-5.GPT-5 (Singh et al., 2025) is OpenAI’s flagship general-purpose model in this study, and we use the standard API configuration with built-in “thinking” enabled. It is a multimodal, tool-using model with strong performance on code, math, and long-context reasoning benchmarks, and is widely deployed in agentic coding systems. We use the gpt-5-2025-08-...

  5. [5]

    Claude 4.0 Sonnet.Claude 4.0 Sonnet (Anthropic, 2025) is Anthropic’s top-end general-purpose model at the time of 27 Task Metadata Docker script library

  6. [6]

    Generate Script with LLM Agent Reasoning Module Docker script Error log Previous attempt log

  7. [7]

    Verifier Successful build unsuccessful build Docker script Docker script Figure 14: Overview of the pipeline for Docker environment synthesis

    Sample chronologically adjacent scripts. Verifier Successful build unsuccessful build Docker script Docker script Figure 14: Overview of the pipeline for Docker environment synthesis. The system reuses chronologically adjacent build scripts when possible, otherwise invoking an LLM agent that generates and refines Docker scripts using build logs and reposit...

  8. [8]

    It offers a very large context window (up to 1M tokens in the preview configuration) and supports advanced tool-calling and code execution

    Gemini 2.5 Pro.Gemini 2.5 Pro (Comanici et al., 2025) is Google DeepMind’s latest high-end model at the time of writing, introduced as the first member of the Gemini 2 series and optimized for complex multimodal reasoning. It offers a very large context window (up to 1M tokens in the preview configuration) and supports advanced tool-calling and code execu...

  9. [9]

    Qwen 3 Coder.Qwen 3 Coder is a large open Mixture-of-Experts model explicitly optimized for agentic coding tasks rather than general conversation. Qwen 3 Coder (in particular, the qwen3-coder-480b-a35b-instruct model) combines 480 B total parameters with sparse expert activation (35 B active parameters per forward pass) and a context window of roughly 262...

  10. [10]

    Terminus 2.Terminus 2 is a reference agent for Terminal-Bench (Merrill et al., 2026). It is intentionally minimal: the agent spawns a single tmux session and exposes the raw shell to the model, which issues commands as plain text and receives the terminal output verbatim, without additional structured tools or high-level abstractions. This architecture ca...

  11. [11]

    ""Get an empty query compiler for the default backend

    OpenHands.OpenHands is a widely used open-source framework for AI-driven software development (Wang et al., 28 Jan-Mar Apr-Jun Jul-Sep Oct-Dec 2025 2024 2023 2022 17 21 20 31 15 21 29 25 42 51 35 15 2 0 26 37 26 33 38 32 42 33 32 43 27 19 45 17 17 35 34 17 16 15 12 27 26 23 7 7 15 13 16 12 10 4 7 18 18 28 10 6 13 4 8 8 8 4 7 9 17 8 Figure 15: Timeline of ...

  12. [12]

    We also showcase the number of tasks, the date of creation of the latest task, and additional information about the functionality and popularity of the repository

    as described in §A.1.2. We also showcase the number of tasks, the date of creation of the latest task, and additional information about the functionality and popularity of the repository. Most repositories are software tools used extensively within scientific communities. Repository Name #Stars #Forks Filter Stage 1 Filter Stage 2 Latest Task Date Description

  13. [13]

    scikit-learn/scikit-learn 63792 26359 2434 243 2025-10-31 scikit-learn: machine learning in Python

  14. [14]

    pandas-dev/pandas 46922 19184 3298 560 2025-11-11 Flexible and powerful data analysis / manipulation library for Python, provid- ing labeled data structures similar to R data.frame objects, statistical functions, and much more

  15. [15]

    scipy/scipy 14120 5516 1454 209 2025-10-29 SciPy library main repository

  16. [16]

    apache/arrow 16089 3884 1988 267 2025-07-22 Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

  17. [17]

    networkx/networkx 16277 3415 288 44 2025-09-16 NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks

  18. [18]

    Qiskit/qiskit 6598 2659 717 212 2025-11-19 Qiskit is an open-source SDK for work- ing with quantum computers at the level of pulses, circuits, and application mod- ules

  19. [19]

    scikit-image/scikit-image 6371 2320 458 54 2025-11-18 Image processing in Python

  20. [20]

    pymc-devs/pymc 9322 2146 685 45 2025-09-23 PyMC (formerly PyMC3) is a Python package for Bayesian statistical model- ing focusing on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms

  21. [21]

    Textualize/rich 54172 1920 165 11 2025-07-25 Rich is a Python library for rich text and beautiful formatting in the terminal

  22. [22]

    tqdm/tqdm 30580 1402 12 1 2022-03-24 Fast, extensible progress bar for Python and CLI

  23. [23]

    pydata/xarray 4004 1192 609 101 2025-11-21 N-D labeled arrays and datasets in Python

  24. [24]

    optuna/optuna 12922 1177 719 112 2025-11-05 A hyperparameter optimization frame- work

  25. [25]

    quantumlib/Cirq 4772 1151 10 3 2025-11-18 Python framework for creating, editing, and invoking Noisy Intermediate-Scale Quantum (NISQ) circuits

  26. [26]

    pvlib/pvlib-python 1424 1126 110 8 2025-10-03 A set of documented functions for sim- ulating the performance of photovoltaic energy systems

  27. [27]

    ipython/ipyparallel 2626 1006 65 6 2024-10-28 IPython Parallel: Interactive Parallel Computing in Python

  28. [28]

    geopandas/geopandas 4940 981 314 22 2025-05-22 Python tools for geographic data Continued on next page 40 Repository Name #Stars #Forks Filter Stage 1 Filter Stage 2 Latest Task Date Description

  29. [29]

    It uses software engineer- ing best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular

    kedro-org/kedro 10593 971 41 4 2025-07-17 Kedro is a toolbox for production-ready data science. It uses software engineer- ing best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular

  30. [30]

    HIPS/autograd 7379 928 13 1 2017-10-21 Efficiently computes derivatives of NumPy code

  31. [31]

    MDAnalysis/mdanalysis 1477 733 196 23 2025-10-13 MDAnalysis is a Python library to ana- lyze molecular dynamics simulations

  32. [32]

    pybamm-team/PyBaMM 1387 692 218 17 2025-04-29 PyBaMM (Python Battery Mathemati- cal Modelling) is an open-source battery simulation package written in Python

  33. [33]

    modin-project/modin 10332 669 50 8 2025-09-30 Speed up your Pandas workflows by changing a single line of code

  34. [34]

    nilearn/nilearn 1322 631 138 2 2025-10-09 Machine learning for NeuroImaging in Python

  35. [35]

    sunpy/sunpy 971 626 663 22 2025-05-16 sunpy is a Python software package that provides fundamental tools for accessing, loading and interacting with solar physics data in Python

  36. [36]

    shapely/shapely 4284 600 150 21 2025-05-03 Manipulation and analysis of geometric objects

  37. [37]

    dedupeio/dedupe 4387 568 25 4 2023-12-19 A python library for accurate and scal- able data deduplication and entity- resolution

  38. [38]

    h5py/h5py 2174 547 263 35 2025-08-10 h5py is a thin, pythonic wrapper around HDF5

  39. [39]

    PyWavelets/pywt 2294 517 12 1 2024-07-16 PyWavelets - Wavelet Transforms in Python

  40. [40]

    pydicom/pydicom 2070 508 86 7 2025-05-12 Read, modify and write DICOM files with python code

  41. [41]

    arviz-devs/arviz 1737 458 107 5 2025-10-21 Exploratory analysis of Bayesian mod- els

  42. [42]

    napari/napari 2512 454 849 69 2025-09-30 napari: a fast, interactive, multi- dimensional image viewer for python

  43. [43]

    tardis-sn/tardis 225 446 268 13 2025-09-16 TARDIS - Temperature And Radiative Diffusion In Supernovae

  44. [44]

    Contains generic methods for spatial normal- ization, signal processing, machine learning, statistical analysis and visual- ization of medical images

    dipy/dipy 787 446 194 16 2025-11-18 DIPY is the paragon 3D/4D+ medical imaging library in Python. Contains generic methods for spatial normal- ization, signal processing, machine learning, statistical analysis and visual- ization of medical images. Additionally, it contains specialized methods for com- putational anatomy including diffusion, perfusion and...

  45. [45]

    python-control/python- control 1908 444 117 6 2025-06-21 The Python Control Systems Library is a Python module that implements basic operations for analysis and design of feedback control systems

  46. [46]

    SciTools/cartopy 1545 389 74 6 2025-04-26 Cartopy is a Python package designed for geospatial data processing in order to produce maps and other geospatial data analyses

  47. [47]

    holoviz/datashader 3467 377 90 19 2025-10-09 Quickly and accurately render even the largest data

  48. [48]

    microsoft/Qcodes 396 335 187 10 2025-09-05 Modular data acquisition framework

  49. [49]

    mars-project/mars 2748 326 164 51 2023-02-16 Mars is a tensor-based unified frame- work for large-scale data computation which scales numpy, pandas, scikit- learn and Python functions

  50. [50]

    pytroll/satpy 1146 320 520 45 2025-08-02 Python package for reading, manipulat- ing and writing satellite data

  51. [51]

    SciTools/iris 692 297 109 23 2025-10-31 A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data

  52. [52]

    lmfit/lmfit-py 1164 290 205 8 2022-09-05 Non-Linear Least Squares Minimiza- tion, with flexible Parameter settings, based on scipy.optimize, and with many additional classes and methods for curve fitting

  53. [53]

    Deepchecks is a holistic open-source solution for all of your AI & ML valida- tion needs, enabling to thoroughly test your data and models from research to production

    deepchecks/deepchecks 3924 286 99 9 2023-12-06 Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML valida- tion needs, enabling to thoroughly test your data and models from research to production

  54. [54]

    devitocodes/devito 632 242 99 7 2025-07-24 DSL and compiler framework for au- tomated finite-differences and stencil computation

  55. [55]

    danielgtaylor/python- betterproto 1733 233 42 1 2023-12-07 Better Protobuf / gRPC code generator and library for Python

  56. [56]

    scikit-learn-contrib/metric- learn 1425 229 6 1 2017-11-27 Metric Learning in Python

  57. [57]

    pydicom/pynetdicom 551 188 24 1 2025-05-24 A Python implementation of the DI- COM networking protocol

  58. [58]

    scverse/anndata 667 175 142 17 2025-07-23 Annotated data matrix for single-cell genomics

  59. [59]

    apache/arrow-adbc 498 160 571 63 2025-11-07 Database connectivity API standard and libraries for Apache Arrow

  60. [60]

    man-group/ArcticDB 2102 153 11 2 2025-11-19 ArcticDB is a high performance data store for time series and tick data

  61. [61]

    stac-utils/pystac 412 127 48 1 2023-03-31 Python library for working with Spa- tioTemporal Asset Catalog (STAC) Continued on next page 42 Repository Name #Stars #Forks Filter Stage 1 Filter Stage 2 Latest Task Date Description

  62. [62]

    xdslproject/xdsl 433 125 2136 236 2025-11-04 A Python compiler design toolkit

  63. [63]

    ActivitySim/activitysim 217 117 51 10 2025-11-12 An open platform for activity-based travel behavior modeling

  64. [64]

    OGGM/oggm 245 115 484 36 2025-04-01 Open Global Glacier Model (OGGM): a modular framework for glacier model- ing

  65. [65]

    datalad/datalad 613 115 426 31 2024-09-10 Keep code, data, containers under con- trol with git and git-annex

  66. [66]

    pydata/bottleneck 1144 112 61 20 2025-04-29 Fast NumPy array functions written in C

  67. [67]

    wmayner/pyphi 406 100 25 1 2024-09-24 A toolbox for integrated information theory

  68. [68]

    django-components/ django-components 1463 100 53 3 2025-09-30 Reusable, composable components for Django templates

  69. [69]

    sourmash-bio/sourmash 524 88 297 27 2025-01-09 Quickly search, compare, and analyze genomic and metagenomic data sets

  70. [70]

    tskit-dev/msprime 201 88 209 9 2025-07-24 Simulate genealogical trees and ge- nomic sequence data using population genetic models

  71. [71]

    numpy/numpy-financial 384 87 13 4 2024-04-04 Financial functions for NumPy

  72. [72]

    makepath/xarray-spatial 894 85 38 9 2023-02-16 Spatial analysis algorithms for xarray implemented in numba

  73. [73]

    dwavesystems/dimod 135 84 152 20 2024-06-13 dimod is a shared API for samplers

  74. [74]

    python-hyper/h11 530 83 18 2 2025-01-12 A pure-Python, bring-your-own-I/O implementation of HTTP/1.1

  75. [75]

    bjodah/chempy 611 81 69 1 2018-03-24 A package useful for chemistry written in Python

  76. [76]

    holoviz/param 497 79 85 10 2025-02-27 Declarative parameters for robust Python classes and a rich API for re- active programming

  77. [77]

    inducer/loopy 615 78 172 15 2023-07-27 A code generator for array computations on CPUs and GPUs

  78. [78]

    holgern/beem 138 75 75 5 2020-12-22 A Python library for Hive and Steem

  79. [79]

    scverse/spatialdata 329 75 20 2 2025-09-29 An open and interoperable data frame- work for spatial omics data

  80. [80]

    pysb/pysb 188 71 107 7 2021-01-20 PySB is a framework for building math- ematical models of biochemical systems as Python programs

Showing first 80 references.