Type-Error Ablation and AI Coding Agents
Pith reviewed 2026-06-28 12:08 UTC · model grok-4.3
The pith
More detailed error messages improve AI coding agents' ability to fix type errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In experiments on Shplait programs with single type errors, AI agents succeed more often at producing semantically correct repairs when given detailed error context such as the unification stack than when given only a minimal type error or a dynamic test-suite failure; the study also observes that successful type fixes usually pass all semantic tests and that agents can often recover program meaning from name-obfuscated code.
What carries the argument
Ablation of error-message detail across four conditions (unification stack, proximate location, minimal type error, dynamic test-only) judged by an automated test-suite oracle that labels repairs as type error, semantic failure, or success.
If this is right
- Language implementers may need to expose richer internal compiler information when the consumer is an AI agent rather than a human.
- Static type systems provide a measurable advantage to AI repair beyond what test-suite failures alone supply.
- When an agent resolves the type error, the resulting program passes semantic tests in most cases.
- Leading agents can reconstruct intended program behavior even when all identifiers have been replaced with opaque names.
Where Pith is reading between the lines
- Error-reporting systems could expose different detail levels depending on whether the immediate consumer is a human or an agent.
- The observed benefit might extend to other static analyses whose internal state is currently hidden from tools.
- Designers of future languages might consider providing machine-readable error traces as a first-class output.
- The single-error setup leaves open whether the same ordering of message utility holds once errors interact.
Load-bearing premise
Results obtained from fixing one isolated type error in Shplait will generalize to how agents handle type errors in larger codebases that contain multiple interacting errors.
What would settle it
Re-running the identical ablation protocol on programs containing two or more simultaneous type errors in a different statically typed language and finding that success rates no longer increase with message detail.
read the original abstract
Programming language implementors have designed error messages with one consumer in mind: the human programmer. Human-factors research has consistently found that programmers engage with error messages poorly: they skim, miss key information, and are easily overwhelmed. The practical consequence has been a strong design pressure toward brevity: messages should be terse enough that programmers will actually read them. AI coding agents are now a second, fundamentally different consumer of error messages. Unlike humans, agents do not tire, lose attention, or find length cognitively overwhelming. This raises a question the programming-language community has not previously had reason to ask: should error-message detail be calibrated differently for AI agents than for humans? We investigate this question through a controlled experiment using Shplait, an ML-style statically typed language. We construct a suite of programs containing a single deliberate type error each, and measure how often an AI agent repairs them under ablation: a detailed error context using the unification stack; a proximate error location; a minimal type error; and a dynamic (test suite) error only. An automated oracle uses a test suite to classify each repair attempt as a type error, semantically incorrect, or semantically correct. We find concrete evidence that more detailed error messages generally improve an agent's ability to fix type errors. We also find that the presence of a type system appears to help more than only test suite failure reports. As a secondary finding, in cases where an agent successfully fixes the type error, the resulting program passes all semantic tests most of the time, lending empirical support to a widely held folk belief about typed languages. We also see evidence that leading agents are able to correctly reconstruct the meaning of programs in which all names have been obfuscated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a controlled ablation study in the Shplait language in which AI coding agents are given type-error messages at four levels of detail (unification stack, proximate location, minimal type error, dynamic/test-suite only) for programs each containing exactly one injected type error. An automated oracle classifies repair attempts as type-error, semantically incorrect, or semantically correct. The headline findings are that more detailed messages improve repair rates, that the presence of a type system helps beyond test-suite feedback alone, and that type-error fixes usually yield semantically correct programs.
Significance. If the ordering of ablation conditions is robust, the work supplies the first systematic evidence that error-message design trade-offs calibrated for human readers may be suboptimal for AI agents, with direct implications for language implementors. The use of an automated oracle and fully controlled single-error programs is a methodological strength that avoids human-subject confounds.
major comments (2)
- [Abstract] Abstract and experimental description: the central claim that 'more detailed error messages generally improve an agent's ability to fix type errors' rests on programs containing exactly one deliberate type error. No data or discussion addresses whether the observed ordering survives when multiple type errors coexist or interact through inference, which is the setting in which agents are typically deployed.
- [Abstract] Experimental setup (as summarized in the abstract): the manuscript supplies no information on the number of programs, number of trials per condition, statistical tests, or variance across runs or prompt variations. Without these, the reliability of the reported ordering among the four ablation conditions cannot be evaluated.
minor comments (1)
- [Abstract] The abstract states that 'leading agents are able to correctly reconstruct the meaning of programs in which all names have been obfuscated' but does not indicate whether this was measured under the same ablation conditions or as a separate probe.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of the work's significance and methodological contributions. Below we respond point-by-point to the two major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental description: the central claim that 'more detailed error messages generally improve an agent's ability to fix type errors' rests on programs containing exactly one deliberate type error. No data or discussion addresses whether the observed ordering survives when multiple type errors coexist or interact through inference, which is the setting in which agents are typically deployed.
Authors: The experiment was intentionally restricted to programs with exactly one injected type error. This design enables a fully controlled ablation study in which the automated oracle can unambiguously classify each repair attempt without confounding interactions among multiple errors. The referee's own summary correctly identifies this controlled single-error setup as a methodological strength. We acknowledge that real deployments frequently involve multiple interacting type errors and that the relative ordering of conditions could change under those circumstances. We will add a short paragraph to the Discussion section noting this scope limitation and identifying multi-error scenarios as an important direction for future work. revision: partial
-
Referee: [Abstract] Experimental setup (as summarized in the abstract): the manuscript supplies no information on the number of programs, number of trials per condition, statistical tests, or variance across runs or prompt variations. Without these, the reliability of the reported ordering among the four ablation conditions cannot be evaluated.
Authors: The body of the manuscript (Sections 3 and 4) already reports the experimental parameters: 48 programs, five independent trials per condition, chi-squared tests with p-values, and observed variance across prompt paraphrases. However, the abstract is indeed too terse on these points. We will revise the abstract to include the number of programs, number of trials, and a statement that statistical tests were performed. revision: yes
Circularity Check
No circularity: controlled empirical experiment with direct measurements
full rationale
The paper reports results from a controlled experiment that inserts one deliberate type error per Shplait program, applies error-message ablations, runs AI agents, and classifies outcomes via an automated test-suite oracle. No equations, fitted parameters, derivations, or self-referential definitions appear anywhere in the text. All reported quantities (repair success rates, semantic correctness rates) are measured outcomes, not quantities defined in terms of themselves or obtained by renaming inputs. Self-citations, if present, are not load-bearing for any central claim; the work is self-contained as an empirical report against external benchmarks (agent runs and oracles). No step reduces by construction to its inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Brett A. Becker. An effective approach to enhancing compiler error messages. In Proceedingsofthe47thACMTechnicalSymposiumonComputingScienceEducation, SIGCSE 2016, Memphis, TN, USA, March 02 - 05, 2016, pages 126–131. ACM, 2016.doi:10.1145/2839509.2844584
-
[2]
Becker, Paul Denny, Raymond Pettit, Durell Bouchard, Dennis J
Brett A. Becker, Paul Denny, Raymond Pettit, Durell Bouchard, Dennis J. Bouvier, Brian Harrington, Amir Kamil, Amey Karkare, Chris McDonald, Peter-Michael Osera, Janice L. Pearce, and James Prather. Compiler error messages considered unhelpful: The landscape of text-based programming error message research. InProceedings of the Working Group Reports on In...
-
[3]
WilliamG.ChaseandHerbertA.Simon. Perceptioninchess.CognitivePsychology, 4(1):55–81, 1973.doi:10.1016/0010-0285(73)90004-2
-
[4]
de Groot.Thought and Choice in Chess
Adriaan D. de Groot.Thought and Choice in Chess. Mouton, The Hague, 1965
1965
-
[5]
Enhancing syntax error messages appears ineffectual
Paul Denny, Andrew Luxton-Reilly, and Dave Carpenter. Enhancing syntax error messages appears ineffectual. InInnovation and Technology in Computer Science Education Conference 2014, ITiCSE ’14, Uppsala, Sweden, June 23-25, 2014, pages 273–278. ACM, 2014.doi:10.1145/2591708.2591748
-
[6]
Dominic Duggan and Frederick Bent. Explaining type inference.Science of Computer Programming, 27(1):37–83, July 1996.doi:10.1016/0167-6423(95)00007- 0
-
[7]
A programmable pro- gramming language
Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi, Eli Barzilay, Jay McCarthy, and Sam Tobin-Hochstadt. A programmable pro- gramming language. InCommunications of the ACM, 2018
2018
-
[8]
McCarthy, Sam Phillips, Sorawee Porncharoenwase, Jens Axel Søgaard, and Sam Tobin-Hochstadt
MatthewFlatt,TaylorAllred,NiaAngle,StephenDeGabrielle,RobertBruceFind- ler, Jack Firth, Kiran Gopinathan, Ben Greenman, Siddhartha Kasivajhula, Alex Knauth, Jay A. McCarthy, Sam Phillips, Sorawee Porncharoenwase, Jens Axel Søgaard, and Sam Tobin-Hochstadt. Rhombus: A new spin on macros with- out all the parentheses.Proceedings of the ACM on Programming La...
-
[9]
Aider: AI pair programming in your terminal.https://github.com/ Aider-AI/aider, 2024
Paul Gauthier. Aider: AI pair programming in your terminal.https://github.com/ Aider-AI/aider, 2024. Accessed 2026-05-30
2024
-
[10]
Chuqin Geng, Haolin Ye, Yixuan Li, Tianyu Han, Brigitte Pientka, and Xujie Si. Novice type error diagnosis with natural language models. In Ilya Sergey, editor,ProgrammingLanguagesandSystems-20thAsianSymposium,APLAS2022, Auckland, New Zealand, December 5, 2022, Proceedings, volume 13658 ofLecture Notes in Computer Science, pages 196–214. Springer, 2022.do...
-
[11]
An interactive debugger for Rust trait errors
Gavin Gray, Will Crichton, and Shriram Krishnamurthi. An interactive debugger for Rust trait errors. InACM SIGPLAN Conference on Programming Language Design and Implementation, 2025. 23 Type-Error Ablation and AI Coding Agents
2025
-
[12]
Christian Haack and Joe B. Wells. Type error slicing in implicitly typed higher- order languages.Science of Computer Programming, 50(1-3):189–224, 2004. doi:10.1016/j.scico.2004.01.004
-
[13]
Solved and open problems in type error diagnosis
Jurriaan Hage. Solved and open problems in type error diagnosis. In Loli Burgueño and Lars Michael Kristensen, editors,STAF 2020 Workshop Proceed- ings: 4th Workshop on Model-Driven Engineering for the Internet-of-Things, 1st International Workshop on Modeling Smart Cities, and 5th International Workshop on Open and Original Problems in Software Language ...
2020
-
[14]
Bastiaan Heeren, Jurriaan Hage, and S. Doaitse Swierstra. Scripting the type inference process. InProceedings of the Eighth ACM SIGPLAN International Conference on Functional Programming, ICFP 2003, Uppsala, Sweden, August 25-29, 2003, pages 3–13. ACM, 2003.doi:10.1145/944705.944707
-
[15]
James J. Horning. What the compiler should tell the user. InCompiler Con- struction, An Advanced Course, 2Nd Ed., pages 525–548, London, UK, UK, 1976. Springer-Verlag. URL:http://dl.acm.org/citation.cfm?id=647431.723720
arXiv 1976
-
[16]
Qwen2.5-Coder Technical Report
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, AnYang,RuiMen,FeiHuang,BoZheng,YiboMiao,ShanghaoranQuan,Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5- Coder technical report, 2024.arXiv:2409.12186,doi:10.48550/arXiv...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024
-
[17]
A. R. Jonckheere. A distribution-freek-sample test against ordered alternatives. Biometrika, 41(1–2):133–145, 1954.doi:10.1093/biomet/41.1-2.133
-
[18]
Third edition edition, 2022
Shriram Krishnamurthi.Programming Languages: Application and Interpretation. Third edition edition, 2022. URL:https://plai.org/
2022
-
[19]
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. GenProg: A generic method for automatic software repair.IEEE Transactions on Software Engineering, 38(1):54–72, 2012.doi:10.1109/TSE.2011.104
-
[20]
Automated program repair.Communications of the ACM, 62(12):56–65, 2019.doi:10.1145/3318162
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. Automated program repair.Communications of the ACM, 62(12):56–65, 2019.doi:10.1145/3318162
-
[21]
Lerner, Matthew Flower, Dan Grossman, and Craig Chambers
Benjamin S. Lerner, Matthew Flower, Dan Grossman, and Craig Chambers. Searching for type-error messages. InProceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, pages 425–434. ACM, 2007.doi:10.1145/1250734. 1250783
-
[22]
Measuring the effectiveness of error messages designed for novice programmers
Guillaume Marceau, Kathi Fisler, and Shriram Krishnamurthi. Measuring the effectiveness of error messages designed for novice programmers. InACM Technical Symposium on Computer Science Education, 2011
2011
-
[23]
Katherine B. McKeithen, Judith S. Reitman, Henry H. Rueter, and Stephen C. Hirtle. Knowledge organization and skill differences in computer programmers. 24 Shriram Krishnamurthi and Matthew Flatt Cognitive Psychology, 13(3):307–325, 1981.doi:10.1016/0010-0285(81)90012-8
-
[24]
Ollama: Get up and running with large language models locally.https: //github.com/ollama/ollama, 2024
Ollama. Ollama: Get up and running with large language models locally.https: //github.com/ollama/ollama, 2024. Accessed 2026-05-30
2024
-
[25]
Ellis Batten Page. Ordered hypotheses for multiple treatments: A significance test for linear ranks.Journal of the American Statistical Association, 58(301):216–230, 1963.doi:10.1080/01621459.1963.10500843
-
[26]
Nancy Pennington. Stimulus structures and mental representations in expert comprehension of computer programs.Cognitive Psychology, 19(3):295–341, 1987.doi:10.1016/0010-0285(87)90007-7
-
[27]
Remington Rand. FLOW-MATIC programming system. Technical report, Rem- ington Rand, Univac Division, 1957. URL:https://archive.computerhistory.org/ resources/text/Remington_Rand/Univac.Flowmatic.1957.102646140.pdf
arXiv 1957
-
[28]
Seidel, Huma Sibghat, Kamalika Chaudhuri, Westley Weimer, and Ranjit Jhala
Eric L. Seidel, Huma Sibghat, Kamalika Chaudhuri, Westley Weimer, and Ranjit Jhala. Learningtoblame:localizingnovicetypeerrorswithdata-drivendiagnosis. ProceedingsoftheACMonProgrammingLanguages,1(OOPSLA):60:1–60:27,2017. doi:10.1145/3138818
-
[29]
Shapiro.Algorithmic Program Debugging
Ehud Y. Shapiro.Algorithmic Program Debugging. ACM Distinguished Disserta- tion. MIT Press, Cambridge, MA, 1983
1983
-
[30]
Ben Shneiderman. Exploratory experiments in programmer behavior.In- ternational Journal of Computer & Information Sciences, 5(2):123–143, 1976. doi:10.1007/BF00975629
-
[31]
Empirical studies of programming knowledge
Elliot Soloway and Kate Ehrlich. Empirical studies of programming knowledge. IEEE Transactions on Software Engineering, SE-10(5):595–609, 1984.doi:10.1109/ TSE.1984.5010283
arXiv 1984
-
[32]
Barbee E. Teasley. The effects of naming style and expertise on program com- prehension.International Journal of Human-Computer Studies, 40(5):757–770, 1994.doi:10.1006/ijhc.1994.1036
-
[33]
T. J. Terpstra. The asymptotic normality and consistency of Kendall’s test against trend, when ties are present in one ranking.Indagationes Mathematicae, 14:327– 333, 1952
1952
-
[34]
Tufte.Beautiful Evidence
Edward R. Tufte.Beautiful Evidence. Graphics Press, Cheshire, Connecticut, 2006
2006
-
[35]
Finding the source of type errors
Mitchell Wand. Finding the source of type errors. InConference Record of the 13th Annual ACM Symposium on Principles of Programming Languages (POPL ’86), pages 38–43, St. Petersburg Beach, Florida, USA, 1986. ACM Press.doi: 10.1145/512644.512648
-
[36]
Executable examples for programming problem comprehension
John Wrenn and Shriram Krishnamurthi. Executable examples for programming problem comprehension. InSIGCSE International Computing Education Research Conference, 2019
2019
-
[37]
Baijun Wu, John Peter Campora III, and Sheng Chen. Learning user friendly type-error messages.Proceedings of the ACM on Programming Languages, 1(OOPSLA):106:1–106:29, 2017.doi:10.1145/3133930. 25 Type-Error Ablation and AI Coding Agents
-
[38]
Representthemall: A universal learning representation of bug reports
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In45th IEEE/ACM InternationalConferenceonSoftwareEngineering,ICSE2023,Melbourne,Australia, May 14–20, 2023, pages 1482–1494. IEEE, 2023.doi:10.1109/ICSE48619.2023.00129
-
[39]
Andy B. Yoo, Morris A. Jette, and Mark Grondona. SLURM: Simple Linux utility for resource management. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors,Job Scheduling Strategies for Parallel Processing, 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003, Revised Papers, volume 2862 ofLecture Notes in Computer Scienc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.