arxiv: 2604.19826 · v1 · submitted 2026-04-20 · 💻 cs.SE · cs.AI· cs.LG

Recognition: unknown

Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

\'Eric Jacopin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:17 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords AI code generationtest syntax structureco-located testsfoundation modelscode preservationattention mechanismsempirical evaluationPython doctests

0 comments

The pith

Co-locating tests with implementation code produces measurably better AI-generated code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether test syntax structure affects the quality of code generated by foundation models. It compares inline test syntax, such as Python doctests placed directly in the code, against separated syntax like Rust #[test] blocks on the same d-ary heap task. Across 12 models and hundreds of generations, inline tests consistently yield near-perfect preservation of the test structure and high correctness rates, while separated tests show wide gaps between models and decoupling of preservation from correctness. Mechanistic analysis finds that inline test markers receive substantially stronger attention in most architectures, with causal experiments confirming the effect. This positions test syntax choice as a practical software design decision when using AI coding tools.

Core claim

In the Foundation Model era, test syntax structure is a software design concern: co-locating tests with implementation code produces measurably better AI-generated code. A large-scale comparison of inline (doctests) versus separated (#[test]) syntax on d-ary heap implementations shows inline tests achieving 100% preservation and 92-100% correctness across models, while separated tests expose model-tier differences and independence between preservation and correctness. Attention analysis on seven open architectures reveals 2.8-4.4× stronger focus on co-located test markers, validated by knockout and steering experiments, and the pattern holds in a non-transformer gated-linear RNN.

What carries the argument

The co-location of test syntax markers directly with implementation code, which draws stronger attention weights in foundation models during generation.

If this is right

Inline test syntax delivers 100% preservation and 92-100% correctness across all tested models.
Separated test syntax produces large correctness gaps (0-100%) that track model capability tiers.
Preservation and correctness become independent under separated syntax but remain coupled under inline syntax.
Inline test markers receive 2.8-4.4 times stronger attention in five of seven examined architectures.
The co-location benefit appears in a gated-linear RNN as well as transformers, indicating robustness beyond current dominant architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Language designers could prioritize inline-test-friendly syntax to improve future AI coding performance.
Prompt techniques that simulate co-location without code changes might capture some of the same gains.
The effect could be tested on other tasks such as refactoring or security review to check generality.
Teams using AI assistants might adopt inline tests as a lightweight practice when the target language supports them.

Load-bearing premise

Observed differences in preservation and correctness stem from test syntax co-location rather than prompt length, token distribution, or model training biases.

What would settle it

Equalizing prompt lengths and token distributions for inline and separated test versions of the same task, then checking whether the large gaps in preservation and correctness disappear.

read the original abstract

AI coding assistants increasingly generate code alongside tests. How developers structure test code, whether inline with the implementation or in separate blocks, has traditionally been a matter of testing philosophy. We investigate whether this choice affects AI code generation quality. We conduct a large-scale empirical study (830+ generated files, 12 models, 3 providers) using SEGA, a three-dimensional evaluation framework measuring Determinism, Preservation, and Correctness. Comparing inline test syntax (Python doctests) against separated test syntax (Rust #[test] blocks) on a d-ary heap implementation, we find that: (1) inline tests yield near-perfect preservation (100%) and correctness (92-100%) across all models; (2) separated tests expose stark model-tier gaps (0-100% correctness) and independence between preservation and correctness; (3) model behavior evolves across generations, and notably one model breaks the test suppression pattern of its three predecessors; (4) mechanistic analysis on 7 open-source architectures (6 transformers and a gated-linear Recurrent Neural Network (RNN)) reveals inline test markers receive 2.8-4.4$\times$ stronger attention in 5/7 models, with causal validation via knockout and steering experiments on the 4 code-specialized transformers and RWKV-6; the co-location mechanism extends to a non-transformer architecture, suggesting the design recommendation is robust to future architectural shifts. In the Foundation Model era, test syntax structure is a software design concern: co-locating tests with implementation code produces measurably better AI-generated code. This arxiv long version includes appendices that further qualify the effect as bounded by both model capability and programming language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in the foundation model era, test syntax structure is a software design concern: co-locating tests with implementation code (inline syntax such as Python doctests) produces measurably better AI-generated code than separated test syntax (e.g., Rust #[test] blocks). This is supported by a large-scale empirical study using the SEGA framework on 830+ generated files across 12 models from 3 providers for a d-ary heap task, showing near-perfect preservation (100%) and high correctness (92-100%) for inline tests versus stark model-tier gaps and independence of preservation/correctness for separated tests; the claim is further backed by mechanistic attention analysis (2.8-4.4x stronger attention to inline markers in 5/7 models) with causal validation via knockout/steering experiments on 4 code-specialized transformers and RWKV-6, extending to a non-transformer architecture.

Significance. If the central empirical result holds after isolating the operative variable, the work is significant for identifying a previously under-appreciated software design lever that affects foundation model code generation quality. It is grounded in a large-scale design (830+ files, 12 models) with attention to determinism/preservation/correctness via SEGA, plus mechanistic causal interventions (knockout and steering) across transformer and non-transformer architectures, and explicit qualification that the effect is bounded by model capability and language. This combination of scale, multi-architecture validation, and falsifiable design recommendation strengthens its potential impact on AI-assisted development practices.

major comments (2)

[Abstract, Results, and Appendices] Abstract and central empirical comparison: the study pits Python doctests (inline, co-located) against Rust #[test] blocks (separated) on the identical d-ary heap task and attributes differences in preservation (near-100% inline) and correctness (92-100% inline vs 0-100% separated) to test syntax co-location. This design confounds syntax structure with language-specific factors including tokenization, prompt length distributions, and pre-training data exposure; no within-language controls (e.g., separate-test Python or inline-test Rust variants) are reported to isolate the claimed mechanism, which is load-bearing for the conclusion that 'test syntax structure is a software design concern'.
[Mechanistic Analysis and Appendices] Mechanistic analysis section: the attention, knockout, and steering results (2.8-4.4x stronger attention to inline markers; causal validation on 4 transformers + RWKV-6) are performed across models but inherit the same Python-vs-Rust language variable as the main experiments. The appendices note language bounds yet the main claims do not provide language-matched controls or quantify how much of the attention differential survives within-language comparison, weakening the robustness argument for future architectures.

minor comments (2)

[Methods] Methods section: provide explicit details on prompt construction, token counts per condition, and any data exclusion rules or statistical controls for model-specific biases to allow readers to assess residual confounds.
[Results] Figure and table clarity: ensure that preservation/correctness plots clearly label language and syntax conditions side-by-side and include confidence intervals or per-model breakdowns to support the 'model-tier gaps' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment point by point below, providing the strongest honest defense of the manuscript while agreeing to clarifications where the design leaves room for qualification.

read point-by-point responses

Referee: [Abstract, Results, and Appendices] Abstract and central empirical comparison: the study pits Python doctests (inline, co-located) against Rust #[test] blocks (separated) on the identical d-ary heap task and attributes differences in preservation (near-100% inline) and correctness (92-100% inline vs 0-100% separated) to test syntax co-location. This design confounds syntax structure with language-specific factors including tokenization, prompt length distributions, and pre-training data exposure; no within-language controls (e.g., separate-test Python or inline-test Rust variants) are reported to isolate the claimed mechanism, which is load-bearing for the conclusion that 'test syntax structure is a software design concern'.

Authors: We acknowledge the cross-language comparison introduces potential confounds such as tokenization and pre-training exposure. However, the design uses the idiomatic, standard syntax for each language to ensure the comparison reflects actual developer practice rather than artificial constructs. Python doctests are the canonical inline mechanism, while Rust #[test] is the standard separated form; non-idiomatic variants (e.g., separate Python tests or inline Rust) would not test the syntax structures as they are used. The SEGA framework holds the task constant and directly measures preservation of the test syntax itself, yielding near-100% preservation for inline across all models. The appendices already bound results by language. We will partially revise the abstract, results, and discussion to more explicitly frame the comparison as between these representative syntax structures and to note that full within-language isolation is left for future work. The core empirical claims and data remain unchanged. revision: partial
Referee: [Mechanistic Analysis and Appendices] Mechanistic analysis section: the attention, knockout, and steering results (2.8-4.4x stronger attention to inline markers; causal validation on 4 transformers + RWKV-6) are performed across models but inherit the same Python-vs-Rust language variable as the main experiments. The appendices note language bounds yet the main claims do not provide language-matched controls or quantify how much of the attention differential survives within-language comparison, weakening the robustness argument for future architectures.

Authors: The mechanistic analysis targets attention specifically to the test markers (doctest prompts vs. #[test] attributes) rather than general language features. These markers receive 2.8-4.4x stronger attention in 5/7 models, with knockout and steering providing causal evidence of their role in generation quality. The RWKV-6 result further supports extension beyond transformers. While we agree language-matched controls would strengthen isolation, they would require non-standard syntax variants not representative of the co-location mechanism under study. The appendices already qualify language bounds. We will partially revise the mechanistic section to add explicit quantification of marker-specific attention differentials and expanded caveats on generalizability, without altering the reported attention ratios, causal results, or main conclusions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurement study with independent experimental results

full rationale

The paper reports results from a large-scale empirical study (830+ files, 12 models) comparing inline test syntax (Python doctests) vs. separated syntax (Rust #[test]) on a d-ary heap task, using SEGA metrics for determinism, preservation, and correctness. Mechanistic attention analysis and causal knockout/steering experiments are performed on 7 architectures. No derivations, equations, or first-principles claims exist that reduce to fitted parameters or self-referential inputs. No self-citations are load-bearing for the central claim; results are directly measured and validated experimentally. The work is self-contained against external benchmarks with no reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the SEGA framework validly measures code quality and that attention differences causally explain generation outcomes; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption The SEGA three-dimensional framework (Determinism, Preservation, Correctness) accurately captures the quality of AI-generated code.
Used as the primary evaluation metric for all 830+ generated files.
domain assumption Stronger attention to inline test markers indicates a causal mechanism for improved code generation.
Supported by knockout and steering experiments but remains an interpretive link.

pith-pipeline@v0.9.0 · 5610 in / 1417 out tokens · 48126 ms · 2026-05-10T04:17:03.110971+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 33 canonical work pages · 11 internal anchors

[1]

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.arXiv preprint arXiv:2404.14219(2024)

work page internal anchor Pith review arXiv 2024
[2]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Moritz Beller, Georgios Gousios, Annibale Panichella, Sebastian Proksch, and Andy Zaidman. 2019. Developer Testing in the IDE: Patterns, Beliefs, and Behavior. IEEE Transactions on Software Engineering45, 3 (2019). doi:10.1109/TSE.2017. 2776152

work page doi:10.1109/tse.2017 2019
[4]

J., Feldman, M

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps- Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation. InIEEE Transactions on Software Engineering. arX...

work page arXiv 2023
[5]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Wei Chen. 2023. CodeT: Code Generation with Generated Tests. In Proceedings of the 11th International Conference on Learning Representations (ICLR). arXiv:2207.10397

work page arXiv 2023
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look At? An Analysis of BERT’s Attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. arXiv:1906.04341

work page Pith review arXiv 2019
[8]

CodeGemma Team. 2024. CodeGemma: Open Code Models Based on Gemma. arXiv preprint arXiv:2406.11409(2024)

work page arXiv 2024
[9]

org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. 2023. Towards Automated Circuit Discovery for Mechanistic Interpretability. InAdvances in Neural Information Processing Systems 36 (NeurIPS). arXiv:2304.14997

work page arXiv 2023
[10]

Shihan Dou et al . 2024. What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study.arXiv preprint arXiv:2407.06153(2024)

work page arXiv 2024
[11]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah....

2021
[12]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report.arXiv preprint arXiv:240...

work page internal anchor Pith review arXiv 2024
[13]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974(2024)

work page internal anchor Pith review arXiv 2024
[14]

JetBrains and Python Software Foundation. 2023. Python Developers Survey 2023 Results. https://lp.jetbrains.com/python-developers-survey-2023/

2023
[15]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InProceedings of the 12th International Conference on Learning Representations (ICLR)

2024
[16]

Anjan Karmakar and Romain Robbes. 2021. What Do Pre-Trained Code Models Know about Code?. InProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). doi:10.1109/ASE51524.2021.9678927

work page doi:10.1109/ase51524.2021.9678927 2021
[17]

Solomon Kullback and Richard A. Leibler. 1951. On Information and Sufficiency. The Annals of Mathematical Statistics22, 1 (1951), 79–86. doi:10.1214/aoms/ 1177729694

work page doi:10.1214/aoms/ 1951
[18]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Lan- guage Models for Code Generation. InAdvances in Neural Information Processing Systems 36 (NeurIPS). arXiv:2305.01210

work page internal anchor Pith review arXiv 2023
[19]

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)

work page internal anchor Pith review arXiv 2024
[20]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT. InAdvances in Neural Information Processing Systems 35 (NeurIPS). arXiv:2202.05262

work page arXiv 2022
[21]

2007.xUnit Test Patterns: Refactoring Test Code

Gerard Meszaros. 2007.xUnit Test Patterns: Refactoring Test Code. Addison-Wesley Professional

2007
[22]

Zhang, Mark Harman, and Meng Wang

Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2023. An Empirical Study of the Non-determinism of ChatGPT in Code Generation.arXiv preprint arXiv:2308.02828(2023)

work page arXiv 2023
[23]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. InProceedings of the 2022 IEEE Symposium on Security and Privacy (S&P). arXiv:2108.09293

work page arXiv 2022
[24]

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. 2023. RWKV: Reinventing RNNs for the Transformer Era. InFindings of the Association for Computational Linguistics: EMNLP 2023. arXiv:2305.13048

work page internal anchor Pith review arXiv 2023
[25]

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al
[26]

Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recur- rence. InFindings of the Association for Computational Linguistics: EMNLP 2024. arXiv:2404.05892

work page arXiv 2024
[27]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Python Software Foundation. 2024. doctest — Test Interactive Python Examples. https://docs.python.org/3/library/doctest.html

2024
[29]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

work page internal anchor Pith review arXiv 2024
[30]

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering(2024). arXiv:2302.06527

work page arXiv 2024
[31]

Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, and Gopi Bhatt. 2025. Bugs in Large Language Models Generated Code: An Empirical Study. InProceed- ings of the IEEE/ACM International Conference on Mining Software Repositories (MSR)

2025
[32]

The Rust Project. 2024. How to Write Tests — The Rust Programming Language. https://doc.rust-lang.org/book/ch11-01-writing-tests.html

2024
[33]

Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models

Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI). doi:10.1145/3491101.3519665

work page doi:10.1145/3491101.3519665 2022
[34]

Arie van Deursen, Leon Moonen, Alex van den Bergh, and Gerard Kok. 2001. Refactoring Test Code. InProceedings of the 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP)

2001
[35]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems 30 (NeurIPS). 5998–6008. arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Jesse Vig and Yonatan Belinkov. 2019. Analyzing the Structure of Attention in a Transformer Language Model. InProceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP. arXiv:1906.04284

work page Pith review arXiv 2019
[37]

Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, and Hai Jin. 2022. What Do They Capture? A Structural Analysis of Pre-Trained Language Models for Source Code. InProceedings of the 44th International Conference on Software Engineering (ICSE). 2377–2388. doi:10.1145/3510003.3510050

work page doi:10.1145/3510003.3510050 2022
[38]

B. L. Welch. 1947. The Generalization of ‘Student’s’ Problem when Several Different Population Variances Are Involved.Biometrika34, 1–2 (1947), 28–35. doi:10.2307/2332510

work page doi:10.2307/2332510 1947
[39]

Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng. 2023. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation.arXiv preprint arXiv:2305.04207(2023)

work page arXiv 2023
[40]

Andy Zaidman, Bart Van Rompaey, Arie van Deursen, and Serge Demeyer. 2011. Studying the Co-evolution of Production and Test Code in Open Source and In- dustrial Developer Test Processes through Repository Mining.Empirical Software Engineering16, 3 (2011), 325–364. doi:10.1007/s10664-010-9143-z

work page doi:10.1007/s10664-010-9143-z 2011
[41]

Fred Zhang and Neel Nanda. 2024. Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. InProceedings of the 12th International Conference on Learning Representations (ICLR). arXiv:2309.16042

work page arXiv 2024
[42]

Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig

Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. 2023. DocPrompting: Generating Code by Retrieving the Docs. InPro- ceedings of the 11th International Conference on Learning Representations (ICLR). arXiv:2207.05987. 10 Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation AI...

work page arXiv 2023