pith. sign in

arxiv: 1906.11456 · v1 · pith:QQNTKRAWnew · submitted 2019-06-27 · 💻 cs.SE

Enhancing Python Compiler Error Messages via Stack Overflow

Pith reviewed 2026-05-25 14:55 UTC · model grok-4.3

classification 💻 cs.SE
keywords compiler error messagesStack OverflowPythonIDE pluginuser studyerror message enhancementPycee
0
0 comments X

The pith

Stack Overflow threads can be automatically mined and summarized to enhance Python compiler error messages inside an IDE.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the idea that discussions on Stack Overflow about Python errors contain usable information that can be collected and shown to programmers without them leaving their editor. The authors built Pycee, a Sublime Text plugin that queries Stack Overflow for each error and displays a custom summary. In a think-aloud study, 16 programmers completed tasks with Pycee and most said it was helpful, preferring it to a version that pulled from official Python documentation because it gave concrete fixes and code examples. The work shows that crowd-sourced Q&A content can be reused automatically to make error messages more actionable.

Core claim

Pycee automatically queries Stack Overflow to provide customised and summarised information about Python compiler errors within the Sublime Text IDE. When evaluated in a user study, the majority of the 16 participants agreed that Pycee was helpful, and they generally preferred it to a baseline using official Python documentation due to its concrete suggestions for fixes and example code.

What carries the argument

Pycee, an IDE plugin that automatically queries Stack Overflow and repackages relevant thread content as enhanced error messages.

If this is right

  • Programmers receive fix suggestions and examples directly in the editor instead of searching separately.
  • Official documentation is no longer the only source for improving error messages.
  • Time spent resolving common Python errors can decrease for users of the enhanced messages.
  • The same reuse of online Q&A content becomes feasible for other programming tasks beyond error messages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on languages other than Python where Stack Overflow has dense error discussions.
  • If the summarization step is made more robust, the same pipeline might apply to runtime errors or warnings.
  • Integration into other editors would let the benefit reach programmers who do not use Sublime Text.

Load-bearing premise

Stack Overflow threads contain accurate, relevant, and summarizable information about Python errors that improves programmer understanding without introducing new confusion or incorrect advice.

What would settle it

A controlled study in which programmers using the Stack Overflow summaries take longer to fix errors or introduce more new bugs than those using only the official documentation would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.11456 by Christoph Treude, Emillie Thiselton.

Figure 1
Figure 1. Figure 1: Screenshot of PYCEE. The first few lines on white background show the original compiler error message produced by Python, the additional lines show the enhanced error message produced by PYCEE. The message provided by PYCEE is a summary of Stack Overflow answer 2395167. Note that in the screenshot, the offending line has already been corrected. by adding related verbs and syntax from other programming lang… view at source ↗
Figure 2
Figure 2. Figure 2: Participant experience in years (log scale) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Perceived helpfulness of PYCEE variants formal(3), e.g., P3 noted “This is not clear at all, I want plain English” after encountering an IndentationError. There was a general feeling among participants that a tool for enhancing compiler error messages should focus on common errors, as stated by P3: “You should use common (basic) errors as test cases when testing this plugin”. When using PYCEE, participants… view at source ↗
Figure 5
Figure 5. Figure 5: User satisfaction for PYCEE variants answers by referring to the style of the enhanced error mes￾sages(4) and the presence of code examples(4). For example, P4 explained: “I liked the code examples and full sentences in normal English, written for humans”. On the other hand, one of the disadvantages of PYCEE is that it relies on information from Stack Overflow which may or may not be correct. Several parti… view at source ↗
read the original abstract

Background: Compilers tend to produce cryptic and uninformative error messages, leaving programmers confused and requiring them to spend precious time to resolve the underlying error. To find help, programmers often take to online question-and-answer forums such as Stack Overflow to start discussion threads about the errors they encountered. Aims: We conjecture that information from Stack Overflow threads which discuss compiler errors can be automatically collected and repackaged to provide programmers with enhanced compiler error messages, thus saving programmers' time and energy. Method: We present Pycee, a plugin integrated with the popular Sublime Text IDE to provide enhanced compiler error messages for the Python programming language. Pycee automatically queries Stack Overflow to provide customised and summarised information within the IDE. We evaluated two Pycee variants through a think-aloud user study during which 16 programmers completed Python programming tasks while using Pycee. Results: The majority of participants agreed that Pycee was helpful while completing the study tasks. When compared to a baseline relying on the official Python documentation to enhance compiler error messages, participants generally preferred Pycee in terms of helpfulness, citing concrete suggestions for fixes and example code as major benefits. Conclusions: Our results confirm that data from online sources such as Stack Overflow can be successfully used to automatically enhance compiler error messages. Our work opens up venues for future work to further enhance compiler error messages as well as to automatically reuse content from Stack Overflow for other aspects of programming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Pycee, a Sublime Text IDE plugin that automatically queries Stack Overflow to retrieve, summarize, and display customized information alongside Python compiler error messages. Two variants are evaluated in a think-aloud study with 16 participants who completed programming tasks; results show that the majority found Pycee helpful and generally preferred it to a baseline that enhanced messages using official Python documentation, primarily due to concrete fix suggestions and example code. The authors conclude that Stack Overflow data can be successfully reused to enhance compiler error messages.

Significance. If the central claim holds, the work demonstrates a practical approach to repurposing online Q&A content for IDE tooling, supported by an empirical user study with direct baseline comparison. This provides qualitative evidence of user preference and opens directions for similar applications to other languages or programming activities. The inclusion of a controlled think-aloud protocol with participant feedback is a positive aspect of the evaluation design.

major comments (2)
  1. [Results / User Study] §Results / User Study: The claim that Stack Overflow data can be 'successfully used' to enhance error messages is supported only by subjective reports of helpfulness and preference from 16 participants. No objective metrics (task completion rates, time-to-fix, or pre/post understanding scores) are reported, so the evidence does not directly address whether the SO-derived content improves error resolution or merely appears appealing.
  2. [Method] §Method: The baseline condition uses 'official Python documentation to enhance compiler error messages,' yet the paper provides no description of how this baseline was implemented or how its content was selected and presented, preventing assessment of whether the observed preference is attributable to SO content specifically or to differences in summarization style.
minor comments (1)
  1. [Abstract] Abstract: The abstract states that 'two Pycee variants' were evaluated but does not indicate what distinguishes the variants or which results apply to each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Results / User Study] §Results / User Study: The claim that Stack Overflow data can be 'successfully used' to enhance error messages is supported only by subjective reports of helpfulness and preference from 16 participants. No objective metrics (task completion rates, time-to-fix, or pre/post understanding scores) are reported, so the evidence does not directly address whether the SO-derived content improves error resolution or merely appears appealing.

    Authors: We agree that the evaluation relies on subjective participant reports from a think-aloud study with 16 programmers rather than objective measures such as task completion time or error resolution accuracy. The study design prioritised qualitative insights into perceived helpfulness and preference under realistic conditions, which aligns with the goal of assessing tool usability. However, the concluding claim that Stack Overflow data can be 'successfully used' is stronger than the subjective evidence warrants. We will revise the Conclusions section to state that the results provide evidence of user preference for the SO-enhanced messages, without claiming objective improvements in error resolution. revision: partial

  2. Referee: [Method] §Method: The baseline condition uses 'official Python documentation to enhance compiler error messages,' yet the paper provides no description of how this baseline was implemented or how its content was selected and presented, preventing assessment of whether the observed preference is attributable to SO content specifically or to differences in summarization style.

    Authors: We accept this criticism. The baseline was created by selecting relevant excerpts from the official Python documentation for each encountered error and formatting them similarly to the Pycee output. We will expand the Method section with a full description of baseline content selection, summarisation approach, and presentation format to enable clearer comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with external user feedback

full rationale

The paper describes an empirical tool (Pycee) that queries Stack Overflow and a think-aloud user study with 16 participants comparing it to official documentation. No equations, fitted parameters, predictions, or derivations appear in the abstract or described method. The central claim rests on participant preference ratings, which constitute external feedback rather than any self-referential reduction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that community-generated Stack Overflow content is suitable for automated summarization and display in error messages. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Stack Overflow threads contain accurate and relevant information about Python compiler errors that can be automatically retrieved and summarized to help programmers.
    This premise underpins both the tool construction and the claim of successful enhancement; it is tested indirectly via user preference but not independently verified.

pith-pipeline@v0.9.0 · 5782 in / 1226 out tokens · 34007 ms · 2026-05-25T14:55:57.713700+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    Maxims for malfeasant designers, or how to design languages to make programming as difficult as possible,

    R. L. Wexelblat, “Maxims for malfeasant designers, or how to design languages to make programming as difficult as possible,” inProceedings of the International Conference on Software Engineering, 1976, pp. 331– 336

  2. [2]

    On compiler error messages: What they say and what they mean,

    V . J. Traver, “On compiler error messages: What they say and what they mean,” Advances in Human-Computer Interaction , vol. 2010, pp. 3:1–3:26, 2010

  3. [3]

    An effective approach to enhancing compiler error messages,

    B. A. Becker, “An effective approach to enhancing compiler error messages,” in Proceedings of the Technical Symposium on Computing Science Education, 2016, pp. 126–131

  4. [4]

    Mind your language: On novices’ interactions with error messages,

    G. Marceau, K. Fisler, and S. Krishnamurthi, “Mind your language: On novices’ interactions with error messages,” in Proceedings of the Sym- posium on New Ideas, New Paradigms, and Reflections on Programming and Software, 2011, pp. 3–18

  5. [5]

    How do programmers ask and answer questions on the web? (NIER track),

    C. Treude, O. Barzilay, and M.-A. Storey, “How do programmers ask and answer questions on the web? (NIER track),” in Proceedings of the International Conference on Software Engineering , 2011, pp. 804–807

  6. [6]

    Ranking crowd knowledge to assist software development,

    L. B. L. de Souza, E. C. Campos, and M. de Almeida Maia, “Ranking crowd knowledge to assist software development,” in Proceedings of the International Conference on Program Comprehension, 2014, pp. 72–82

  7. [7]

    What makes a good code example?: A study of programming Q&A in StackOverflow,

    S. M. Nasehi, J. Sillito, F. Maurer, and C. Burns, “What makes a good code example?: A study of programming Q&A in StackOverflow,” in Proceedings of the International Conference on Software Maintenance , 2012, pp. 25–34

  8. [8]

    Redocumenting APIs with crowd knowledge: a coverage analysis based on question types,

    F. M. Delfim, K. V . R. Paix ˜ao, D. Cassou, and M. de Almeida Maia, “Redocumenting APIs with crowd knowledge: a coverage analysis based on question types,” Journal of the Brazilian Computer Society , vol. 22, no. 1, 2016

  9. [9]

    What information about code snippets is available in differ- ent software-related documents? An exploratory study,

    P. Chatterjee, M. A. Nishi, K. Damevski, V . Augustine, L. Pollock, and N. A. Kraft, “What information about code snippets is available in differ- ent software-related documents? An exploratory study,” in Proceedings of the International Conference on Software Analysis, Evolution and Reengineering, 2017, pp. 382–386

  10. [10]

    Holistic recommender systems for software engineering,

    L. Ponzanelli, “Holistic recommender systems for software engineering,” in Companion Proceedings of the International Conference on Software Engineering, 2014, pp. 686–689

  11. [11]

    Augmenting API documentation with insights from Stack Overflow,

    C. Treude and M. P. Robillard, “Augmenting API documentation with insights from Stack Overflow,” in Proceedings of the International Conference on Software Engineering , 2016, pp. 392–403

  12. [12]

    Effective compiler error message enhancement for novice programming students,

    B. A. Becker, G. Glanville, R. Iwashima, C. McDonnell, K. Goslin, and C. Mooney, “Effective compiler error message enhancement for novice programming students,” Computer Science Education , vol. 26, no. 2–3, pp. 148–175, 2016

  13. [13]

    Automatic query reformulations for text retrieval in soft- ware engineering,

    S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies, “Automatic query reformulations for text retrieval in soft- ware engineering,” in Proceedings of the International Conference on Software Engineering, 2013, pp. 842–851

  14. [14]

    Query expansion via WordNet for effective code search,

    M. Lu, X. Sun, S. Wang, D. Lo, and Y . Duan, “Query expansion via WordNet for effective code search,” in Proceedings of the International Conference on Software Analysis, Evolution, and Reengineering , 2015, pp. 545–549

  15. [15]

    An empirical investigation into programming language syntax,

    A. Stefik and S. Siebert, “An empirical investigation into programming language syntax,” ACM Transactions on Computing Education , vol. 13, no. 4, pp. 19:1–19:40, 2013

  16. [16]

    Using task context to improve pro- grammer productivity,

    M. Kersten and G. C. Murphy, “Using task context to improve pro- grammer productivity,” in Proceedings of the International Symposium on Foundations of Software Engineering , 2006, pp. 1–11

  17. [17]

    Extracting development tasks to navigate software documentation,

    C. Treude, M. P. Robillard, and B. Dagenais, “Extracting development tasks to navigate software documentation,” IEEE Transactions on Soft- ware Engineering, vol. 41, no. 6, pp. 565–581, 2015

  18. [18]

    Tasknav: Task- based navigation of software documentation,

    C. Treude, M. Sicard, M. Klocke, and M. P. Robillard, “Tasknav: Task- based navigation of software documentation,” in Proceedings of the International Conference on Software Engineering - Volume 2 , 2015, pp. 649–652

  19. [19]

    Sewordsim: Software-specific word similarity database,

    Y . Tian, D. Lo, and J. Lawall, “Sewordsim: Software-specific word similarity database,” in Companion Proceedings of the International Conference on Software Engineering , 2014, pp. 568–571

  20. [20]

    Online python tutor: Embeddable web-based program visu- alization for CS education,

    P. J. Guo, “Online python tutor: Embeddable web-based program visu- alization for CS education,” in Proceeding of the Technical Symposium on Computer Science Education , 2013, pp. 579–584

  21. [21]

    Debugging with the crowd: a debug recommendation system based on Stackoverflow,

    M. Monperrus and A. Maia, “Debugging with the crowd: a debug recommendation system based on Stackoverflow,” Universit ´e Lille 1 - Sciences et Technologies, Tech. Rep. hal-00987395, 2014

  22. [22]

    The automatic creation of literature abstracts,

    H. P. Luhn, “The automatic creation of literature abstracts,” IBM Journal of Research and Development , vol. 2, no. 2, pp. 159–165, 1958

  23. [23]

    Compiler error messages: What can help novices?

    M.-H. Nienaltowski, M. Pedroni, and B. Meyer, “Compiler error messages: What can help novices?” in Proceedings of the Technical Symposium on Computer Science Education , 2008, pp. 168–172

  24. [24]

    Automatic generation of natural language summaries for Java classes,

    L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pollock, and K. Vijay- Shanker, “Automatic generation of natural language summaries for Java classes,” in Proceedings of the International Conference on Program Comprehension, 2013, pp. 23–32

  25. [25]

    Strauss and J

    A. Strauss and J. Corbin, Basics of qualitative research: Techniques and procedures for developing grounded theory, 2nd ed. Sage Publications, Inc., 1998

  26. [26]

    Grounded theory in software engineering research: A critical review and guidelines,

    K.-J. Stol, P. Ralph, and B. Fitzgerald, “Grounded theory in software engineering research: A critical review and guidelines,” in Proceedings of the International Conference on Software Engineering, 2016, pp. 120– 131

  27. [27]

    Bazeley and K

    P. Bazeley and K. Jackson, Qualitative data analysis with NVivo . Sage Publications Limited, 2013

  28. [28]

    Toxic code snippets on Stack Overflow,

    C. Ragkhitwetsagul, J. Krinke, M. Paixao, G. Bianco, and R. Oliveto, “Toxic code snippets on Stack Overflow,” IEEE Transactions on Soft- ware Engineering, 2019, to appear

  29. [29]

    Patterns of knowledge in API reference documentation,

    W. Maalej and M. P. Robillard, “Patterns of knowledge in API reference documentation,” IEEE Transactions on Software Engineering , vol. 39, no. 9, pp. 1264–1282, 2013

  30. [30]

    Crowd documen- tation: Exploring the coverage and the dynamics of API discussions on Stack Overflow,

    C. Parnin, C. Treude, L. Grammel, and M.-A. Storey, “Crowd documen- tation: Exploring the coverage and the dynamics of API discussions on Stack Overflow,” Georgia Institute of Technology, Tech. Rep., 2012

  31. [31]

    Reviewing the quality of awareness support in collaborative applications,

    P. Antunes, V . Herskovic, S. F. Ochoa, and J. A. Pino, “Reviewing the quality of awareness support in collaborative applications,” Journal of Systems and Software , vol. 89, no. C, pp. 146–169, 2014

  32. [32]

    Compiler error notifications revisited: An interaction-first approach for helping developers more effectively comprehend and resolve error notifications,

    T. Barik, J. Witschey, B. Johnson, and E. Murphy-Hill, “Compiler error notifications revisited: An interaction-first approach for helping developers more effectively comprehend and resolve error notifications,” in Companion Proceedings of the International Conference on Software Engineering, 2014, pp. 536–539

  33. [33]

    On novices’ interaction with compiler error messages: A human factors approach,

    J. Prather, R. Pettit, K. H. McMurry, A. Peters, J. Homer, N. Simone, and M. Cohen, “On novices’ interaction with compiler error messages: A human factors approach,” in Proceedings of the Conference on International Computing Education Research , 2017, pp. 74–82

  34. [34]

    Usability measurement and metrics: A consolidated model,

    A. Seffah, M. Donyaee, R. B. Kline, and H. K. Padda, “Usability measurement and metrics: A consolidated model,” Software Quality Journal, vol. 14, no. 2, pp. 159–178, 2006

  35. [35]

    Identifying and correcting Java programming errors for introductory computer science students,

    M. Hristova, A. Misra, M. Rutter, and R. Mercuri, “Identifying and correcting Java programming errors for introductory computer science students,” in Proceedings of the Technical Symposium on Computer Science Education, 2003, pp. 153–156

  36. [36]

    Seahawk: Stack Overflow in the IDE,

    L. Ponzanelli, A. Bacchelli, and M. Lanza, “Seahawk: Stack Overflow in the IDE,” in Proceedings of the International Conference on Software Engineering, 2013, pp. 1295–1298

  37. [37]

    Mining StackOverflow to turn the IDE into a self-confident program- ming prompter,

    L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza, “Mining StackOverflow to turn the IDE into a self-confident program- ming prompter,” in Proceedings of the Working Conference on Mining Software Repositories, 2014, pp. 102–111

  38. [38]

    Context-based recommendation to support problem solving in software development,

    J. Cordeiro, B. Antunes, and P. Gomes, “Context-based recommendation to support problem solving in software development,” in Proceedings of the International Workshop on Recommendation Systems for Software Engineering, 2012, pp. 85–89

  39. [39]

    Autocomment: Mining question and answer sites for automatic comment generation,

    E. Wong, J. Yang, and L. Tan, “Autocomment: Mining question and answer sites for automatic comment generation,” in Proceedings of the International Conference on Automated Software Engineering, 2013, pp. 562–567

  40. [40]

    NLP2Code: Code snippet content assist via natural language tasks,

    B. A. Campbell and C. Treude, “NLP2Code: Code snippet content assist via natural language tasks,” in Proceedings of the International Conference on Software Maintenance and Evolution, 2017, pp. 628–632

  41. [41]

    Bing developer assistant: Improving developer productivity by recom- mending sample code,

    H. Zhang, A. Jain, G. Khandelwal, C. Kaushik, S. Ge, and W. Hu, “Bing developer assistant: Improving developer productivity by recom- mending sample code,” in Proceedings of the International Symposium on Foundations of Software Engineering , 2016, pp. 956–961

  42. [42]

    Understanding Stack Overflow code fragments,

    C. Treude and M. P. Robillard, “Understanding Stack Overflow code fragments,” in Proceedings of the International Conference on Software Maintenance and Evolution , 2017, pp. 509–513

  43. [43]

    On the use of automated text summarization techniques for summarizing source code,

    S. Haiduc, J. Aponte, L. Moreno, and A. Marcus, “On the use of automated text summarization techniques for summarizing source code,” in Proceedings of the Working Conference on Reverse Engineering , 2010, pp. 35–44

  44. [44]

    Automatic source code summa- rization of context for Java methods,

    P. W. McBurney and C. McMillan, “Automatic source code summa- rization of context for Java methods,” IEEE Transactions on Software Engineering, vol. 42, no. 2, pp. 103–119, 2016

  45. [45]

    Automatically generating documentation for lambda expressions in Java,

    A. Alqaimi, P. Thongtanunam, and C. Treude, “Automatically generating documentation for lambda expressions in Java,” in Proceedings of the International Conference on Mining Software Repositories , 2019, pp. 310–320

  46. [46]

    Code fragment summarization,

    A. T. T. Ying and M. P. Robillard, “Code fragment summarization,” in Proceedings of the Joint Meeting on Foundations of Software Engineer- ing, 2013, pp. 655–658

  47. [47]

    Automatic documentation inference for exceptions,

    R. P. Buse and W. R. Weimer, “Automatic documentation inference for exceptions,” in Proceedings of the International Symposium on Software Testing and Analysis, 2008, pp. 273–282

  48. [48]

    Generating natural language summaries for crosscutting source code concerns,

    S. Rastkar, G. C. Murphy, and A. W. J. Bradley, “Generating natural language summaries for crosscutting source code concerns,” in Proceed- ings of the International Conference on Software Maintenance, 2011, pp. 103–112

  49. [49]

    Summarizing software arti- facts: A case study of bug reports,

    S. Rastkar, G. C. Murphy, and G. Murray, “Summarizing software arti- facts: A case study of bug reports,” in Proceedings of the International Conference on Software Engineering - Volume 1 , 2010, pp. 505–514

  50. [50]

    Automatic summarization of bug reports,

    ——, “Automatic summarization of bug reports,” IEEE Transactions on Software Engineering, vol. 40, no. 4, pp. 366–380, 2014