pith. machine review for the scientific record. sign in

arxiv: 2604.19826 · v1 · submitted 2026-04-20 · 💻 cs.SE · cs.AI· cs.LG

Recognition: unknown

Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:17 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords AI code generationtest syntax structureco-located testsfoundation modelscode preservationattention mechanismsempirical evaluationPython doctests
0
0 comments X

The pith

Co-locating tests with implementation code produces measurably better AI-generated code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether test syntax structure affects the quality of code generated by foundation models. It compares inline test syntax, such as Python doctests placed directly in the code, against separated syntax like Rust #[test] blocks on the same d-ary heap task. Across 12 models and hundreds of generations, inline tests consistently yield near-perfect preservation of the test structure and high correctness rates, while separated tests show wide gaps between models and decoupling of preservation from correctness. Mechanistic analysis finds that inline test markers receive substantially stronger attention in most architectures, with causal experiments confirming the effect. This positions test syntax choice as a practical software design decision when using AI coding tools.

Core claim

In the Foundation Model era, test syntax structure is a software design concern: co-locating tests with implementation code produces measurably better AI-generated code. A large-scale comparison of inline (doctests) versus separated (#[test]) syntax on d-ary heap implementations shows inline tests achieving 100% preservation and 92-100% correctness across models, while separated tests expose model-tier differences and independence between preservation and correctness. Attention analysis on seven open architectures reveals 2.8-4.4× stronger focus on co-located test markers, validated by knockout and steering experiments, and the pattern holds in a non-transformer gated-linear RNN.

What carries the argument

The co-location of test syntax markers directly with implementation code, which draws stronger attention weights in foundation models during generation.

If this is right

  • Inline test syntax delivers 100% preservation and 92-100% correctness across all tested models.
  • Separated test syntax produces large correctness gaps (0-100%) that track model capability tiers.
  • Preservation and correctness become independent under separated syntax but remain coupled under inline syntax.
  • Inline test markers receive 2.8-4.4 times stronger attention in five of seven examined architectures.
  • The co-location benefit appears in a gated-linear RNN as well as transformers, indicating robustness beyond current dominant architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Language designers could prioritize inline-test-friendly syntax to improve future AI coding performance.
  • Prompt techniques that simulate co-location without code changes might capture some of the same gains.
  • The effect could be tested on other tasks such as refactoring or security review to check generality.
  • Teams using AI assistants might adopt inline tests as a lightweight practice when the target language supports them.

Load-bearing premise

Observed differences in preservation and correctness stem from test syntax co-location rather than prompt length, token distribution, or model training biases.

What would settle it

Equalizing prompt lengths and token distributions for inline and separated test versions of the same task, then checking whether the large gaps in preservation and correctness disappear.

read the original abstract

AI coding assistants increasingly generate code alongside tests. How developers structure test code, whether inline with the implementation or in separate blocks, has traditionally been a matter of testing philosophy. We investigate whether this choice affects AI code generation quality. We conduct a large-scale empirical study (830+ generated files, 12 models, 3 providers) using SEGA, a three-dimensional evaluation framework measuring Determinism, Preservation, and Correctness. Comparing inline test syntax (Python doctests) against separated test syntax (Rust #[test] blocks) on a d-ary heap implementation, we find that: (1) inline tests yield near-perfect preservation (100%) and correctness (92-100%) across all models; (2) separated tests expose stark model-tier gaps (0-100% correctness) and independence between preservation and correctness; (3) model behavior evolves across generations, and notably one model breaks the test suppression pattern of its three predecessors; (4) mechanistic analysis on 7 open-source architectures (6 transformers and a gated-linear Recurrent Neural Network (RNN)) reveals inline test markers receive 2.8-4.4$\times$ stronger attention in 5/7 models, with causal validation via knockout and steering experiments on the 4 code-specialized transformers and RWKV-6; the co-location mechanism extends to a non-transformer architecture, suggesting the design recommendation is robust to future architectural shifts. In the Foundation Model era, test syntax structure is a software design concern: co-locating tests with implementation code produces measurably better AI-generated code. This arxiv long version includes appendices that further qualify the effect as bounded by both model capability and programming language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in the foundation model era, test syntax structure is a software design concern: co-locating tests with implementation code (inline syntax such as Python doctests) produces measurably better AI-generated code than separated test syntax (e.g., Rust #[test] blocks). This is supported by a large-scale empirical study using the SEGA framework on 830+ generated files across 12 models from 3 providers for a d-ary heap task, showing near-perfect preservation (100%) and high correctness (92-100%) for inline tests versus stark model-tier gaps and independence of preservation/correctness for separated tests; the claim is further backed by mechanistic attention analysis (2.8-4.4x stronger attention to inline markers in 5/7 models) with causal validation via knockout/steering experiments on 4 code-specialized transformers and RWKV-6, extending to a non-transformer architecture.

Significance. If the central empirical result holds after isolating the operative variable, the work is significant for identifying a previously under-appreciated software design lever that affects foundation model code generation quality. It is grounded in a large-scale design (830+ files, 12 models) with attention to determinism/preservation/correctness via SEGA, plus mechanistic causal interventions (knockout and steering) across transformer and non-transformer architectures, and explicit qualification that the effect is bounded by model capability and language. This combination of scale, multi-architecture validation, and falsifiable design recommendation strengthens its potential impact on AI-assisted development practices.

major comments (2)
  1. [Abstract, Results, and Appendices] Abstract and central empirical comparison: the study pits Python doctests (inline, co-located) against Rust #[test] blocks (separated) on the identical d-ary heap task and attributes differences in preservation (near-100% inline) and correctness (92-100% inline vs 0-100% separated) to test syntax co-location. This design confounds syntax structure with language-specific factors including tokenization, prompt length distributions, and pre-training data exposure; no within-language controls (e.g., separate-test Python or inline-test Rust variants) are reported to isolate the claimed mechanism, which is load-bearing for the conclusion that 'test syntax structure is a software design concern'.
  2. [Mechanistic Analysis and Appendices] Mechanistic analysis section: the attention, knockout, and steering results (2.8-4.4x stronger attention to inline markers; causal validation on 4 transformers + RWKV-6) are performed across models but inherit the same Python-vs-Rust language variable as the main experiments. The appendices note language bounds yet the main claims do not provide language-matched controls or quantify how much of the attention differential survives within-language comparison, weakening the robustness argument for future architectures.
minor comments (2)
  1. [Methods] Methods section: provide explicit details on prompt construction, token counts per condition, and any data exclusion rules or statistical controls for model-specific biases to allow readers to assess residual confounds.
  2. [Results] Figure and table clarity: ensure that preservation/correctness plots clearly label language and syntax conditions side-by-side and include confidence intervals or per-model breakdowns to support the 'model-tier gaps' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment point by point below, providing the strongest honest defense of the manuscript while agreeing to clarifications where the design leaves room for qualification.

read point-by-point responses
  1. Referee: [Abstract, Results, and Appendices] Abstract and central empirical comparison: the study pits Python doctests (inline, co-located) against Rust #[test] blocks (separated) on the identical d-ary heap task and attributes differences in preservation (near-100% inline) and correctness (92-100% inline vs 0-100% separated) to test syntax co-location. This design confounds syntax structure with language-specific factors including tokenization, prompt length distributions, and pre-training data exposure; no within-language controls (e.g., separate-test Python or inline-test Rust variants) are reported to isolate the claimed mechanism, which is load-bearing for the conclusion that 'test syntax structure is a software design concern'.

    Authors: We acknowledge the cross-language comparison introduces potential confounds such as tokenization and pre-training exposure. However, the design uses the idiomatic, standard syntax for each language to ensure the comparison reflects actual developer practice rather than artificial constructs. Python doctests are the canonical inline mechanism, while Rust #[test] is the standard separated form; non-idiomatic variants (e.g., separate Python tests or inline Rust) would not test the syntax structures as they are used. The SEGA framework holds the task constant and directly measures preservation of the test syntax itself, yielding near-100% preservation for inline across all models. The appendices already bound results by language. We will partially revise the abstract, results, and discussion to more explicitly frame the comparison as between these representative syntax structures and to note that full within-language isolation is left for future work. The core empirical claims and data remain unchanged. revision: partial

  2. Referee: [Mechanistic Analysis and Appendices] Mechanistic analysis section: the attention, knockout, and steering results (2.8-4.4x stronger attention to inline markers; causal validation on 4 transformers + RWKV-6) are performed across models but inherit the same Python-vs-Rust language variable as the main experiments. The appendices note language bounds yet the main claims do not provide language-matched controls or quantify how much of the attention differential survives within-language comparison, weakening the robustness argument for future architectures.

    Authors: The mechanistic analysis targets attention specifically to the test markers (doctest prompts vs. #[test] attributes) rather than general language features. These markers receive 2.8-4.4x stronger attention in 5/7 models, with knockout and steering providing causal evidence of their role in generation quality. The RWKV-6 result further supports extension beyond transformers. While we agree language-matched controls would strengthen isolation, they would require non-standard syntax variants not representative of the co-location mechanism under study. The appendices already qualify language bounds. We will partially revise the mechanistic section to add explicit quantification of marker-specific attention differentials and expanded caveats on generalizability, without altering the reported attention ratios, causal results, or main conclusions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurement study with independent experimental results

full rationale

The paper reports results from a large-scale empirical study (830+ files, 12 models) comparing inline test syntax (Python doctests) vs. separated syntax (Rust #[test]) on a d-ary heap task, using SEGA metrics for determinism, preservation, and correctness. Mechanistic attention analysis and causal knockout/steering experiments are performed on 7 architectures. No derivations, equations, or first-principles claims exist that reduce to fitted parameters or self-referential inputs. No self-citations are load-bearing for the central claim; results are directly measured and validated experimentally. The work is self-contained against external benchmarks with no reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the SEGA framework validly measures code quality and that attention differences causally explain generation outcomes; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The SEGA three-dimensional framework (Determinism, Preservation, Correctness) accurately captures the quality of AI-generated code.
    Used as the primary evaluation metric for all 830+ generated files.
  • domain assumption Stronger attention to inline test markers indicates a causal mechanism for improved code generation.
    Supported by knockout and steering experiments but remains an interpretive link.

pith-pipeline@v0.9.0 · 5610 in / 1417 out tokens · 48126 ms · 2026-05-10T04:17:03.110971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 33 canonical work pages · 11 internal anchors

  1. [1]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.arXiv preprint arXiv:2404.14219(2024)

  2. [2]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)

  3. [3]

    Moritz Beller, Georgios Gousios, Annibale Panichella, Sebastian Proksch, and Andy Zaidman. 2019. Developer Testing in the IDE: Patterns, Beliefs, and Behavior. IEEE Transactions on Software Engineering45, 3 (2019). doi:10.1109/TSE.2017. 2776152

  4. [4]

    J., Feldman, M

    Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps- Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation. InIEEE Transactions on Software Engineering. arX...

  5. [5]

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Wei Chen. 2023. CodeT: Code Generation with Generated Tests. In Proceedings of the 11th International Conference on Learning Representations (ICLR). arXiv:2207.10397

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)

  7. [7]

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look At? An Analysis of BERT’s Attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. arXiv:1906.04341

  8. [8]

    CodeGemma Team. 2024. CodeGemma: Open Code Models Based on Gemma. arXiv preprint arXiv:2406.11409(2024)

  9. [9]

    org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes

    Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. 2023. Towards Automated Circuit Discovery for Mechanistic Interpretability. InAdvances in Neural Information Processing Systems 36 (NeurIPS). arXiv:2304.14997

  10. [10]

    Shihan Dou et al . 2024. What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study.arXiv preprint arXiv:2407.06153(2024)

  11. [11]

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah....

  12. [12]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report.arXiv preprint arXiv:240...

  13. [13]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974(2024)

  14. [14]

    JetBrains and Python Software Foundation. 2023. Python Developers Survey 2023 Results. https://lp.jetbrains.com/python-developers-survey-2023/

  15. [15]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InProceedings of the 12th International Conference on Learning Representations (ICLR)

  16. [16]

    Anjan Karmakar and Romain Robbes. 2021. What Do Pre-Trained Code Models Know about Code?. InProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). doi:10.1109/ASE51524.2021.9678927

  17. [17]

    Solomon Kullback and Richard A. Leibler. 1951. On Information and Sufficiency. The Annals of Mathematical Statistics22, 1 (1951), 79–86. doi:10.1214/aoms/ 1177729694

  18. [18]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Lan- guage Models for Code Generation. InAdvances in Neural Information Processing Systems 36 (NeurIPS). arXiv:2305.01210

  19. [19]

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)

  20. [20]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT. InAdvances in Neural Information Processing Systems 35 (NeurIPS). arXiv:2202.05262

  21. [21]

    2007.xUnit Test Patterns: Refactoring Test Code

    Gerard Meszaros. 2007.xUnit Test Patterns: Refactoring Test Code. Addison-Wesley Professional

  22. [22]

    Zhang, Mark Harman, and Meng Wang

    Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2023. An Empirical Study of the Non-determinism of ChatGPT in Code Generation.arXiv preprint arXiv:2308.02828(2023)

  23. [23]

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. InProceedings of the 2022 IEEE Symposium on Security and Privacy (S&P). arXiv:2108.09293

  24. [24]

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. 2023. RWKV: Reinventing RNNs for the Transformer Era. InFindings of the Association for Computational Linguistics: EMNLP 2023. arXiv:2305.13048

  25. [25]

    Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al

  26. [26]

    Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence

    Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recur- rence. InFindings of the Association for Computational Linguistics: EMNLP 2024. arXiv:2404.05892

  27. [27]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023)

  28. [28]

    Python Software Foundation. 2024. doctest — Test Interactive Python Examples. https://docs.python.org/3/library/doctest.html

  29. [29]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

  30. [30]

    Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering(2024). arXiv:2302.06527

  31. [31]

    Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, and Gopi Bhatt. 2025. Bugs in Large Language Models Generated Code: An Empirical Study. InProceed- ings of the IEEE/ACM International Conference on Mining Software Repositories (MSR)

  32. [32]

    The Rust Project. 2024. How to Write Tests — The Rust Programming Language. https://doc.rust-lang.org/book/ch11-01-writing-tests.html

  33. [33]

    Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models

    Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI). doi:10.1145/3491101.3519665

  34. [34]

    Arie van Deursen, Leon Moonen, Alex van den Bergh, and Gerard Kok. 2001. Refactoring Test Code. InProceedings of the 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP)

  35. [35]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems 30 (NeurIPS). 5998–6008. arXiv:1706.03762

  36. [36]

    Jesse Vig and Yonatan Belinkov. 2019. Analyzing the Structure of Attention in a Transformer Language Model. InProceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP. arXiv:1906.04284

  37. [37]

    Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, and Hai Jin. 2022. What Do They Capture? A Structural Analysis of Pre-Trained Language Models for Source Code. InProceedings of the 44th International Conference on Software Engineering (ICSE). 2377–2388. doi:10.1145/3510003.3510050

  38. [38]

    B. L. Welch. 1947. The Generalization of ‘Student’s’ Problem when Several Different Population Variances Are Involved.Biometrika34, 1–2 (1947), 28–35. doi:10.2307/2332510

  39. [39]

    Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng. 2023. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation.arXiv preprint arXiv:2305.04207(2023)

  40. [40]

    Andy Zaidman, Bart Van Rompaey, Arie van Deursen, and Serge Demeyer. 2011. Studying the Co-evolution of Production and Test Code in Open Source and In- dustrial Developer Test Processes through Repository Mining.Empirical Software Engineering16, 3 (2011), 325–364. doi:10.1007/s10664-010-9143-z

  41. [41]

    Fred Zhang and Neel Nanda. 2024. Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. InProceedings of the 12th International Conference on Learning Representations (ICLR). arXiv:2309.16042

  42. [42]

    Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig

    Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. 2023. DocPrompting: Generating Code by Retrieving the Docs. InPro- ceedings of the 11th International Conference on Learning Representations (ICLR). arXiv:2207.05987. 10 Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation AI...