Executability of Python Snippets in Stack Overflow

Abram Hindle; Changyuan Lin; Hamzeh Khazaei; Md Monir Hossain; Nima Mahmoudi

arxiv: 1907.04908 · v1 · pith:XR5GG4UOnew · submitted 2019-07-10 · 💻 cs.SE

Executability of Python Snippets in Stack Overflow

Md Monir Hossain , Nima Mahmoudi , Changyuan Lin , Hamzeh Khazaei , Abram Hindle This is my paper

Pith reviewed 2026-05-24 23:25 UTC · model grok-4.3

classification 💻 cs.SE

keywords PythonStack Overflowcode executabilitySOTorrentempirical studysoftware engineeringcode snippets

0 comments

The pith

With minor adjustments, 27.92% of Stack Overflow Python snippets are executable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a scalable framework to evaluate the executability of Python code snippets from Stack Overflow on a large scale. The analysis reveals that 27.92% of snippets can be executed after minor adjustments, and this proportion has remained consistent over time. Code snippets linked from GitHub tend to be more directly executable without changes. However, whether a snippet is executable does not significantly influence whether its answer is accepted on the platform.

Core claim

The authors apply their scalable framework to SOTorrent Python snippets and find that with minor adjustments 27.92% are executable. Executability rates have not changed significantly over time. Snippets referenced in GitHub are more likely to be directly executable. Executability does not affect the chances of an answer being selected as accepted significantly.

What carries the argument

Scalable framework developed to investigate executability of code snippets by attempting execution while handling errors.

If this is right

Executability of snippets remains stable over years.
GitHub-referenced snippets show higher direct executability.
Acceptance as best answer is not significantly tied to executability.
These findings aid understanding of user interaction with online code resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automated tools might help apply the minor adjustments to make more snippets usable.
The approach could be applied to other languages to compare executability rates.
Low executability may discourage reuse of snippets from online sources.

Load-bearing premise

The scalable framework accurately measures true executability by correctly managing environment setups, dependencies, and treating incomplete snippets as adjustable.

What would settle it

A manual verification of a subset of snippets in a standard Python setup to confirm if the executability percentage matches the framework's findings.

Figures

Figures reproduced from arXiv: 1907.04908 by Abram Hindle, Changyuan Lin, Hamzeh Khazaei, Md Monir Hossain, Nima Mahmoudi.

**Figure 2.** Figure 2: Change in executability with line count (solid line), [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: shows how Python2, Python3 and overall executability have changed in time. We fitted a linear regression model to the data and the coefficients show an increase of 0.1% per year for overall, a decrease of 0.1% per year for Python2 and an increase of 0.5% per year for Python3 executability. This shows that the executability of the python code snippets doesn’t change significantly with time [PITH_FULL_IMAG… view at source ↗

**Figure 4.** Figure 4: Cross platform change in executability over time. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Online resources today contain an abundant amount of code snippets for documentation, collaboration, learning, and problem-solving purposes. Their executability in a "plug and play" manner enables us to confirm their quality and use them directly in projects. But, in practice that is often not the case due to several requirements violations or incompleteness. However, it is a difficult task to investigate the executability on a large scale due to different possible errors during the execution. We have developed a scalable framework to investigate this for SOTorrent Python snippets. We found that with minor adjustments, 27.92% of snippets are executable. The executability has not changed significantly over time. The code snippets referenced in GitHub are more likely to be directly executable. But executability does not affect the chances of the answer to be selected as the accepted answer significantly. These properties help us understand and improve the interaction of users with online resources that include code snippets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a 27.92% executability rate for Python SO snippets after minor fixes, with no time trend and a GitHub link boost, but the framework's decisions lack any described validation.

read the letter

The main result is a measurement: with their framework, 27.92% of SOTorrent Python snippets become executable after minor adjustments, the rate holds steady over time, and snippets referenced on GitHub are more likely to run directly. The accepted-answer link is weaker. That is the concrete output they deliver from the dataset. The work is a straightforward empirical count on a large public corpus, which is the part that could be useful to people tracking code reuse or SO quality. They built a scalable execution framework to get these numbers at volume, and the GitHub correlation is a new angle compared with earlier language-specific studies. The soft spot is exactly where the stress-test note points: the abstract and available text give no account of how the framework decides what counts as a minor adjustment, how it sets up the environment, or whether its classifications were checked against manual runs. Without those details the headline percentage and the stability claim rest on unverified automation. The circularity burden is low because this is pure measurement, not a fitted model. For a reader who needs the raw numbers on Python snippet usability this is worth seeing once the methods are filled in. It is not reshaping the field but it supplies a specific data point that prior work on other languages did not cover. I would send it to review so the execution pipeline and any validation steps can be examined; the topic is narrow enough that a desk reject would be premature if the framework is reproducible.

Referee Report

3 major / 2 minor

Summary. The paper develops a scalable framework for assessing the executability of Python code snippets extracted from the SOTorrent dataset (Stack Overflow). It reports that 27.92% of snippets are executable after minor adjustments, that executability rates have not changed significantly over time, that snippets referenced from GitHub are more likely to execute directly, and that executability has no significant effect on whether an answer is accepted.

Significance. If the automated classification proves reliable, the work supplies large-scale empirical measurements on a practical issue in software engineering: the reusability of online code examples. The scale of the SOTorrent analysis and the temporal/GitHub comparisons are strengths that could inform platform design and developer tooling.

major comments (3)

[Framework / Methods] Framework description (likely §3 or §4): the manuscript supplies no concrete description of the execution environment, dependency resolution strategy, or precise criteria used to distinguish 'minor adjustments' (e.g., adding imports) from non-minor changes. Without these details the 27.92% headline figure cannot be reproduced or assessed for over- or under-counting.
[Results / Evaluation] Validation of classification: no sample-based manual verification, inter-rater agreement, or comparison against an alternative execution setup is reported. This directly affects the soundness of every quantitative claim, including the temporal-stability and GitHub-referenced results.
[Results on temporal trends] Temporal analysis: the statement that 'executability has not changed significantly over time' is presented without the statistical test, model, p-values, or effect-size numbers used to reach that conclusion, making it impossible to judge whether the stability claim is supported by the data.

minor comments (2)

[Abstract] Abstract: the phrase 'minor adjustments' is used without even a high-level gloss; a single sentence defining the category would improve readability.
[Dataset description] The paper should cite the exact SOTorrent release or query used to extract the Python snippets so that the dataset slice is reproducible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important gaps in methodological transparency and validation that we agree require attention. We will revise the manuscript to provide the requested details on the execution framework, add sample-based validation, and include the statistical tests for temporal trends. Our responses to each major comment follow.

read point-by-point responses

Referee: [Framework / Methods] Framework description (likely §3 or §4): the manuscript supplies no concrete description of the execution environment, dependency resolution strategy, or precise criteria used to distinguish 'minor adjustments' (e.g., adding imports) from non-minor changes. Without these details the 27.92% headline figure cannot be reproduced or assessed for over- or under-counting.

Authors: We agree that the current description of the framework is insufficient for full reproducibility. In the revised manuscript we will expand the relevant section to specify the execution environment (Python 3.6 on Ubuntu 18.04 via Docker), the dependency handling strategy (attempting to install missing packages via pip when an ImportError occurs, with a timeout and failure on unresolved dependencies), and the exact criteria for minor adjustments (limited to adding missing import statements, providing default values for undefined variables, or minor syntax fixes that do not alter the intended logic of the snippet). These additions will allow readers to evaluate potential over- or under-counting of the 27.92% figure. revision: yes
Referee: [Results / Evaluation] Validation of classification: no sample-based manual verification, inter-rater agreement, or comparison against an alternative execution setup is reported. This directly affects the soundness of every quantitative claim, including the temporal-stability and GitHub-referenced results.

Authors: The absence of manual validation is a genuine limitation in the submitted version. We will add a new subsection reporting the results of a manual review of a random sample of 200 snippets (100 classified as executable, 100 as non-executable) performed by two authors, including Cohen's kappa for inter-rater agreement. We will also discuss a small-scale comparison against an alternative environment (Python 3.7) on a subset of snippets to assess sensitivity of the classification. revision: yes
Referee: [Results on temporal trends] Temporal analysis: the statement that 'executability has not changed significantly over time' is presented without the statistical test, model, p-values, or effect-size numbers used to reach that conclusion, making it impossible to judge whether the stability claim is supported by the data.

Authors: We will revise the temporal analysis section to report the full statistical details: a logistic regression model with year as the predictor and executability as the binary outcome, including the coefficient, standard error, p-value, and odds ratio (effect size). This will make the basis for the 'no significant change' claim transparent and allow readers to assess its support. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical counts from framework on external dataset

full rationale

The paper reports direct empirical measurements of snippet executability obtained by applying a custom framework to the SOTorrent Python dataset. No derivations, equations, fitted parameters, predictions, or uniqueness theorems are present. All headline statistics (27.92% executable with adjustments, temporal stability, GitHub linkage) are simple aggregates or proportions computed from framework outputs on the input corpus. No self-citation chains, ansatzes, or renamings reduce any claim to its own inputs by construction. The study is self-contained against external benchmarks as a measurement exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central measurements rest on the representativeness of the SOTorrent dataset and the correctness of the custom execution framework; no free parameters or invented entities are introduced.

axioms (2)

domain assumption SOTorrent dataset contains a representative sample of Stack Overflow Python snippets suitable for large-scale executability analysis
The study selects this dataset as the source for all snippets examined.
domain assumption Minor adjustments can be defined and applied consistently without introducing bias into the executability count
The 27.92% figure depends on this operational definition of 'minor adjustments'.

pith-pipeline@v0.9.0 · 5701 in / 1271 out tokens · 42849 ms · 2026-05-24T23:25:16.027201+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Some from here, some from there: Cross-project code reuse in github,

M. Gharehyazie, B. Ray, and V . Filkov, “Some from here, some from there: Cross-project code reuse in github,” in Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on , pp. 291–301, IEEE, 2017

work page 2017
[2]

On code reuse from stackoverﬂow: An exploratory study on android apps,

R. Abdalkareem, E. Shihab, and J. Rilling, “On code reuse from stackoverﬂow: An exploratory study on android apps,” Information and Software Technology, vol. 88, pp. 148–158, 2017

work page 2017
[3]

SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets

S. Baltes, C. Treude, and S. Diehl, “Sotorrent: Studying the origin, evolution, and usage of stack overﬂow code snippets,” arXiv preprint arXiv:1809.02814, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Awareness and Experience of Developers to Outdated and License-Violating Code on Stack Overflow: An Online Survey

C. Ragkhitwetsagul, J. Krinke, and R. Oliveto, “Awareness and expe- rience of developers to outdated and license-violating code on stack overﬂow: An online survey,” arXiv preprint arXiv:1806.08149 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Stack overﬂow: a code laundering platform?,

L. An, O. Mlouki, F. Khomh, and G. Antoniol, “Stack overﬂow: a code laundering platform?,” in Software Analysis, Evolution and Reengineering (SANER), 2017 IEEE 24th International Conference on , pp. 283–293, IEEE, 2017

work page 2017
[6]

Attribution required: stack overﬂow code snippets in github projects,

S. Baltes, R. Kiefer, and S. Diehl, “Attribution required: stack overﬂow code snippets in github projects,” in Proceedings of the 39th Interna- tional Conference on Software Engineering Companion , pp. 161–163, IEEE Press, 2017

work page 2017
[7]

Toxic Code Snippets on Stack Overflow

C. Ragkhitwetsagul, J. Krinke, M. Paixao, G. Bianco, and R. Oliveto, “Toxic code snippets on stack overﬂow,” arXiv preprint arXiv:1806.07659, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Are code examples on an online q&a forum reliable?: a study of api misuse on stack overﬂow,

T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim, “Are code examples on an online q&a forum reliable?: a study of api misuse on stack overﬂow,” in Proceedings of the 40th International Conference on Software Engineering , pp. 886–896, ACM, 2018

work page 2018
[9]

Two studies of opportunistic programming: interleaving web foraging, learning, and writing code,

J. Brandt, P. J. Guo, J. Lewenstein, M. Dontcheva, and S. R. Klemmer, “Two studies of opportunistic programming: interleaving web foraging, learning, and writing code,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , pp. 1589–1598, ACM, 2009

work page 2009
[10]

Making sense of online code snippets,

S. Subramanian and R. Holmes, “Making sense of online code snippets,” in Proceedings of the 10th Working Conference on Mining Software Repositories, pp. 85–88, IEEE Press, 2013

work page 2013
[11]

Xsnippet: Mining for sample code,

N. Sahavechaphan and K. Claypool, “Xsnippet: Mining for sample code,” ACM Sigplan Notices , vol. 41, no. 10, pp. 413–430, 2006

work page 2006
[12]

Spotting working code examples,

I. Keivanloo, J. Rilling, and Y . Zou, “Spotting working code examples,” in Proceedings of the 36th International Conference on Software Engi- neering, pp. 664–675, ACM, 2014

work page 2014
[13]

Gistable: Evaluating the executability of python code snippets on github,

E. Horton and C. Parnin, “Gistable: Evaluating the executability of python code snippets on github,” in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pp. 217–227, IEEE, 2018

work page 2018
[14]

What makes a good code example?: A study of programming q&a in stackoverﬂow,

S. M. Nasehi, J. Sillito, F. Maurer, and C. Burns, “What makes a good code example?: A study of programming q&a in stackoverﬂow,” in 2012 28th IEEE International Conference on Software Maintenance (ICSM) , pp. 25–34, IEEE, 2012

work page 2012

[1] [1]

Some from here, some from there: Cross-project code reuse in github,

M. Gharehyazie, B. Ray, and V . Filkov, “Some from here, some from there: Cross-project code reuse in github,” in Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on , pp. 291–301, IEEE, 2017

work page 2017

[2] [2]

On code reuse from stackoverﬂow: An exploratory study on android apps,

R. Abdalkareem, E. Shihab, and J. Rilling, “On code reuse from stackoverﬂow: An exploratory study on android apps,” Information and Software Technology, vol. 88, pp. 148–158, 2017

work page 2017

[3] [3]

SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets

S. Baltes, C. Treude, and S. Diehl, “Sotorrent: Studying the origin, evolution, and usage of stack overﬂow code snippets,” arXiv preprint arXiv:1809.02814, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Awareness and Experience of Developers to Outdated and License-Violating Code on Stack Overflow: An Online Survey

C. Ragkhitwetsagul, J. Krinke, and R. Oliveto, “Awareness and expe- rience of developers to outdated and license-violating code on stack overﬂow: An online survey,” arXiv preprint arXiv:1806.08149 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Stack overﬂow: a code laundering platform?,

L. An, O. Mlouki, F. Khomh, and G. Antoniol, “Stack overﬂow: a code laundering platform?,” in Software Analysis, Evolution and Reengineering (SANER), 2017 IEEE 24th International Conference on , pp. 283–293, IEEE, 2017

work page 2017

[6] [6]

Attribution required: stack overﬂow code snippets in github projects,

S. Baltes, R. Kiefer, and S. Diehl, “Attribution required: stack overﬂow code snippets in github projects,” in Proceedings of the 39th Interna- tional Conference on Software Engineering Companion , pp. 161–163, IEEE Press, 2017

work page 2017

[7] [7]

Toxic Code Snippets on Stack Overflow

C. Ragkhitwetsagul, J. Krinke, M. Paixao, G. Bianco, and R. Oliveto, “Toxic code snippets on stack overﬂow,” arXiv preprint arXiv:1806.07659, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Are code examples on an online q&a forum reliable?: a study of api misuse on stack overﬂow,

T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim, “Are code examples on an online q&a forum reliable?: a study of api misuse on stack overﬂow,” in Proceedings of the 40th International Conference on Software Engineering , pp. 886–896, ACM, 2018

work page 2018

[9] [9]

Two studies of opportunistic programming: interleaving web foraging, learning, and writing code,

J. Brandt, P. J. Guo, J. Lewenstein, M. Dontcheva, and S. R. Klemmer, “Two studies of opportunistic programming: interleaving web foraging, learning, and writing code,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , pp. 1589–1598, ACM, 2009

work page 2009

[10] [10]

Making sense of online code snippets,

S. Subramanian and R. Holmes, “Making sense of online code snippets,” in Proceedings of the 10th Working Conference on Mining Software Repositories, pp. 85–88, IEEE Press, 2013

work page 2013

[11] [11]

Xsnippet: Mining for sample code,

N. Sahavechaphan and K. Claypool, “Xsnippet: Mining for sample code,” ACM Sigplan Notices , vol. 41, no. 10, pp. 413–430, 2006

work page 2006

[12] [12]

Spotting working code examples,

I. Keivanloo, J. Rilling, and Y . Zou, “Spotting working code examples,” in Proceedings of the 36th International Conference on Software Engi- neering, pp. 664–675, ACM, 2014

work page 2014

[13] [13]

Gistable: Evaluating the executability of python code snippets on github,

E. Horton and C. Parnin, “Gistable: Evaluating the executability of python code snippets on github,” in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pp. 217–227, IEEE, 2018

work page 2018

[14] [14]

What makes a good code example?: A study of programming q&a in stackoverﬂow,

S. M. Nasehi, J. Sillito, F. Maurer, and C. Burns, “What makes a good code example?: A study of programming q&a in stackoverﬂow,” in 2012 28th IEEE International Conference on Software Maintenance (ICSM) , pp. 25–34, IEEE, 2012

work page 2012