pith. sign in

arxiv: 1907.04908 · v1 · pith:XR5GG4UOnew · submitted 2019-07-10 · 💻 cs.SE

Executability of Python Snippets in Stack Overflow

Pith reviewed 2026-05-24 23:25 UTC · model grok-4.3

classification 💻 cs.SE
keywords PythonStack Overflowcode executabilitySOTorrentempirical studysoftware engineeringcode snippets
0
0 comments X

The pith

With minor adjustments, 27.92% of Stack Overflow Python snippets are executable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a scalable framework to evaluate the executability of Python code snippets from Stack Overflow on a large scale. The analysis reveals that 27.92% of snippets can be executed after minor adjustments, and this proportion has remained consistent over time. Code snippets linked from GitHub tend to be more directly executable without changes. However, whether a snippet is executable does not significantly influence whether its answer is accepted on the platform.

Core claim

The authors apply their scalable framework to SOTorrent Python snippets and find that with minor adjustments 27.92% are executable. Executability rates have not changed significantly over time. Snippets referenced in GitHub are more likely to be directly executable. Executability does not affect the chances of an answer being selected as accepted significantly.

What carries the argument

Scalable framework developed to investigate executability of code snippets by attempting execution while handling errors.

If this is right

  • Executability of snippets remains stable over years.
  • GitHub-referenced snippets show higher direct executability.
  • Acceptance as best answer is not significantly tied to executability.
  • These findings aid understanding of user interaction with online code resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated tools might help apply the minor adjustments to make more snippets usable.
  • The approach could be applied to other languages to compare executability rates.
  • Low executability may discourage reuse of snippets from online sources.

Load-bearing premise

The scalable framework accurately measures true executability by correctly managing environment setups, dependencies, and treating incomplete snippets as adjustable.

What would settle it

A manual verification of a subset of snippets in a standard Python setup to confirm if the executability percentage matches the framework's findings.

Figures

Figures reproduced from arXiv: 1907.04908 by Abram Hindle, Changyuan Lin, Hamzeh Khazaei, Md Monir Hossain, Nima Mahmoudi.

Figure 1
Figure 1. Figure 1: An overview of the developed system architecture. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Change in executability with line count (solid line), [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows how Python2, Python3 and overall exe￾cutability have changed in time. We fitted a linear regression model to the data and the coefficients show an increase of 0.1% per year for overall, a decrease of 0.1% per year for Python2 and an increase of 0.5% per year for Python3 executability. This shows that the executability of the python code snippets doesn’t change significantly with time [PITH_FULL_IMAG… view at source ↗
Figure 4
Figure 4. Figure 4: Cross platform change in executability over time. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Online resources today contain an abundant amount of code snippets for documentation, collaboration, learning, and problem-solving purposes. Their executability in a "plug and play" manner enables us to confirm their quality and use them directly in projects. But, in practice that is often not the case due to several requirements violations or incompleteness. However, it is a difficult task to investigate the executability on a large scale due to different possible errors during the execution. We have developed a scalable framework to investigate this for SOTorrent Python snippets. We found that with minor adjustments, 27.92% of snippets are executable. The executability has not changed significantly over time. The code snippets referenced in GitHub are more likely to be directly executable. But executability does not affect the chances of the answer to be selected as the accepted answer significantly. These properties help us understand and improve the interaction of users with online resources that include code snippets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper develops a scalable framework for assessing the executability of Python code snippets extracted from the SOTorrent dataset (Stack Overflow). It reports that 27.92% of snippets are executable after minor adjustments, that executability rates have not changed significantly over time, that snippets referenced from GitHub are more likely to execute directly, and that executability has no significant effect on whether an answer is accepted.

Significance. If the automated classification proves reliable, the work supplies large-scale empirical measurements on a practical issue in software engineering: the reusability of online code examples. The scale of the SOTorrent analysis and the temporal/GitHub comparisons are strengths that could inform platform design and developer tooling.

major comments (3)
  1. [Framework / Methods] Framework description (likely §3 or §4): the manuscript supplies no concrete description of the execution environment, dependency resolution strategy, or precise criteria used to distinguish 'minor adjustments' (e.g., adding imports) from non-minor changes. Without these details the 27.92% headline figure cannot be reproduced or assessed for over- or under-counting.
  2. [Results / Evaluation] Validation of classification: no sample-based manual verification, inter-rater agreement, or comparison against an alternative execution setup is reported. This directly affects the soundness of every quantitative claim, including the temporal-stability and GitHub-referenced results.
  3. [Results on temporal trends] Temporal analysis: the statement that 'executability has not changed significantly over time' is presented without the statistical test, model, p-values, or effect-size numbers used to reach that conclusion, making it impossible to judge whether the stability claim is supported by the data.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'minor adjustments' is used without even a high-level gloss; a single sentence defining the category would improve readability.
  2. [Dataset description] The paper should cite the exact SOTorrent release or query used to extract the Python snippets so that the dataset slice is reproducible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important gaps in methodological transparency and validation that we agree require attention. We will revise the manuscript to provide the requested details on the execution framework, add sample-based validation, and include the statistical tests for temporal trends. Our responses to each major comment follow.

read point-by-point responses
  1. Referee: [Framework / Methods] Framework description (likely §3 or §4): the manuscript supplies no concrete description of the execution environment, dependency resolution strategy, or precise criteria used to distinguish 'minor adjustments' (e.g., adding imports) from non-minor changes. Without these details the 27.92% headline figure cannot be reproduced or assessed for over- or under-counting.

    Authors: We agree that the current description of the framework is insufficient for full reproducibility. In the revised manuscript we will expand the relevant section to specify the execution environment (Python 3.6 on Ubuntu 18.04 via Docker), the dependency handling strategy (attempting to install missing packages via pip when an ImportError occurs, with a timeout and failure on unresolved dependencies), and the exact criteria for minor adjustments (limited to adding missing import statements, providing default values for undefined variables, or minor syntax fixes that do not alter the intended logic of the snippet). These additions will allow readers to evaluate potential over- or under-counting of the 27.92% figure. revision: yes

  2. Referee: [Results / Evaluation] Validation of classification: no sample-based manual verification, inter-rater agreement, or comparison against an alternative execution setup is reported. This directly affects the soundness of every quantitative claim, including the temporal-stability and GitHub-referenced results.

    Authors: The absence of manual validation is a genuine limitation in the submitted version. We will add a new subsection reporting the results of a manual review of a random sample of 200 snippets (100 classified as executable, 100 as non-executable) performed by two authors, including Cohen's kappa for inter-rater agreement. We will also discuss a small-scale comparison against an alternative environment (Python 3.7) on a subset of snippets to assess sensitivity of the classification. revision: yes

  3. Referee: [Results on temporal trends] Temporal analysis: the statement that 'executability has not changed significantly over time' is presented without the statistical test, model, p-values, or effect-size numbers used to reach that conclusion, making it impossible to judge whether the stability claim is supported by the data.

    Authors: We will revise the temporal analysis section to report the full statistical details: a logistic regression model with year as the predictor and executability as the binary outcome, including the coefficient, standard error, p-value, and odds ratio (effect size). This will make the basis for the 'no significant change' claim transparent and allow readers to assess its support. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical counts from framework on external dataset

full rationale

The paper reports direct empirical measurements of snippet executability obtained by applying a custom framework to the SOTorrent Python dataset. No derivations, equations, fitted parameters, predictions, or uniqueness theorems are present. All headline statistics (27.92% executable with adjustments, temporal stability, GitHub linkage) are simple aggregates or proportions computed from framework outputs on the input corpus. No self-citation chains, ansatzes, or renamings reduce any claim to its own inputs by construction. The study is self-contained against external benchmarks as a measurement exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central measurements rest on the representativeness of the SOTorrent dataset and the correctness of the custom execution framework; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption SOTorrent dataset contains a representative sample of Stack Overflow Python snippets suitable for large-scale executability analysis
    The study selects this dataset as the source for all snippets examined.
  • domain assumption Minor adjustments can be defined and applied consistently without introducing bias into the executability count
    The 27.92% figure depends on this operational definition of 'minor adjustments'.

pith-pipeline@v0.9.0 · 5701 in / 1271 out tokens · 42849 ms · 2026-05-24T23:25:16.027201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    Some from here, some from there: Cross-project code reuse in github,

    M. Gharehyazie, B. Ray, and V . Filkov, “Some from here, some from there: Cross-project code reuse in github,” in Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on , pp. 291–301, IEEE, 2017

  2. [2]

    On code reuse from stackoverflow: An exploratory study on android apps,

    R. Abdalkareem, E. Shihab, and J. Rilling, “On code reuse from stackoverflow: An exploratory study on android apps,” Information and Software Technology, vol. 88, pp. 148–158, 2017

  3. [3]

    SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets

    S. Baltes, C. Treude, and S. Diehl, “Sotorrent: Studying the origin, evolution, and usage of stack overflow code snippets,” arXiv preprint arXiv:1809.02814, 2018

  4. [4]

    Awareness and Experience of Developers to Outdated and License-Violating Code on Stack Overflow: An Online Survey

    C. Ragkhitwetsagul, J. Krinke, and R. Oliveto, “Awareness and expe- rience of developers to outdated and license-violating code on stack overflow: An online survey,” arXiv preprint arXiv:1806.08149 , 2018

  5. [5]

    Stack overflow: a code laundering platform?,

    L. An, O. Mlouki, F. Khomh, and G. Antoniol, “Stack overflow: a code laundering platform?,” in Software Analysis, Evolution and Reengineering (SANER), 2017 IEEE 24th International Conference on , pp. 283–293, IEEE, 2017

  6. [6]

    Attribution required: stack overflow code snippets in github projects,

    S. Baltes, R. Kiefer, and S. Diehl, “Attribution required: stack overflow code snippets in github projects,” in Proceedings of the 39th Interna- tional Conference on Software Engineering Companion , pp. 161–163, IEEE Press, 2017

  7. [7]

    Toxic Code Snippets on Stack Overflow

    C. Ragkhitwetsagul, J. Krinke, M. Paixao, G. Bianco, and R. Oliveto, “Toxic code snippets on stack overflow,” arXiv preprint arXiv:1806.07659, 2018

  8. [8]

    Are code examples on an online q&a forum reliable?: a study of api misuse on stack overflow,

    T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim, “Are code examples on an online q&a forum reliable?: a study of api misuse on stack overflow,” in Proceedings of the 40th International Conference on Software Engineering , pp. 886–896, ACM, 2018

  9. [9]

    Two studies of opportunistic programming: interleaving web foraging, learning, and writing code,

    J. Brandt, P. J. Guo, J. Lewenstein, M. Dontcheva, and S. R. Klemmer, “Two studies of opportunistic programming: interleaving web foraging, learning, and writing code,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , pp. 1589–1598, ACM, 2009

  10. [10]

    Making sense of online code snippets,

    S. Subramanian and R. Holmes, “Making sense of online code snippets,” in Proceedings of the 10th Working Conference on Mining Software Repositories, pp. 85–88, IEEE Press, 2013

  11. [11]

    Xsnippet: Mining for sample code,

    N. Sahavechaphan and K. Claypool, “Xsnippet: Mining for sample code,” ACM Sigplan Notices , vol. 41, no. 10, pp. 413–430, 2006

  12. [12]

    Spotting working code examples,

    I. Keivanloo, J. Rilling, and Y . Zou, “Spotting working code examples,” in Proceedings of the 36th International Conference on Software Engi- neering, pp. 664–675, ACM, 2014

  13. [13]

    Gistable: Evaluating the executability of python code snippets on github,

    E. Horton and C. Parnin, “Gistable: Evaluating the executability of python code snippets on github,” in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pp. 217–227, IEEE, 2018

  14. [14]

    What makes a good code example?: A study of programming q&a in stackoverflow,

    S. M. Nasehi, J. Sillito, F. Maurer, and C. Burns, “What makes a good code example?: A study of programming q&a in stackoverflow,” in 2012 28th IEEE International Conference on Software Maintenance (ICSM) , pp. 25–34, IEEE, 2012