The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation

Akanksha Narula; Laurent Bindschaedler; Mofasshara Binte Rafique

arxiv: 2606.04057 · v1 · pith:JWNWGB7Nnew · submitted 2026-06-02 · 💻 cs.SE · cs.AI· cs.LG

The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation

Akanksha Narula , Mofasshara Binte Rafique , Laurent Bindschaedler This is my paper

Pith reviewed 2026-06-28 08:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords LLM code generationprompt sensitivityalgorithm choicealgorithm steeringprompt cuessoftware engineeringinvisible lottery

0 comments

The pith

Incidental prompt cues steer which algorithm LLMs select for coding tasks, even when all outputs pass the same tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that contextual words or metadata outside the core task specification can influence the algorithm family an LLM chooses when generating code. It demonstrates this through 46,535 controlled experiments across 11 tasks and 19 cue types, revealing shifts in algorithm distributions as large as 100 percentage points that track the meaning of the cues. This creates an unpredictable selection process over code properties like performance and security. Direct naming of the desired algorithm emerges as the strongest tested control on the choice.

Core claim

Incidental prompt cues steer algorithm choice under fixed correctness, producing systematic shifts in algorithm-family distributions up to 100 percentage points that align with cue semantics across tasks including rate limiting, with direct algorithm naming as the most reliable mitigation tested.

What carries the argument

Algorithm steering, defined as cue-induced shifts in algorithm-family distributions, quantified via controlled prompt variations and code parsing across models.

If this is right

Accidental context in prompts creates an invisible lottery over performance, security, and maintainability of generated code.
Direct algorithm naming in the prompt reliably reduces unwanted steering across the tested models and tasks.
Shifts appear in applied settings such as rate limiting and remain consistent with cue meaning rather than surface form.
The effect holds across 15 model configurations and persists even when all generated solutions pass identical tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt design guidelines for code generation may need explicit rules against incidental context that could bias algorithm selection.
Developers using LLMs for production code could benefit from always specifying the target algorithm family to remove hidden variability.
Future model training could incorporate techniques to reduce sensitivity to non-task cues in algorithm choice.

Load-bearing premise

Observed distribution shifts are driven by the semantic content of the cues rather than sampling noise, parsing differences, or task ambiguity, and algorithm families can be identified consistently from generated code.

What would settle it

Repeating the full experiment suite with fixed sampling seeds and an independent code classifier to test whether the shifts remain when cue semantics are isolated from other variables.

Figures

Figures reproduced from arXiv: 2606.04057 by Akanksha Narula, Laurent Bindschaedler, Mofasshara Binte Rafique.

**Figure 1.** Figure 1: Four representative steering cases. Each group holds the task specification fixed and compares the max–min cue pair within that task–channel; bars report the selected algorithm-family rate. Persona Context Import Dependency Benchmark Code Review Constraint Test Complexity Style Docstring Style Comment Density Variable Naming Error Handling Input Format Temporal Placebo Innocuous Interaction Expr. Parsing M… view at source ↗

**Figure 2.** Figure 2: Maximum steering delta (pp) per task–channel pair. Each cell is the largest algorithm-family swing observed when varying cue values within that channel. Appendix [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Representative steering examples showing algorithm-family distributions before (baseline) and after a cue. Appendix B.2 gives the full per-cue distributions. der junior, with a corresponding pass rate penalty (80% vs. 100%; Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Model divergence under identical cues. Each cell shows a target algorithm’s selection rate for one model–cue pair; pass rate is annotated when below 85%. Columns span three tasks and six cue conditions. opposite directions across models. Finding. Steering generalizes across models and tested temperatures; direction varies by model. Quantization and Reasoning Quantization modestly attenuates steering: Dee… view at source ↗

**Figure 5.** Figure 5: Baseline-referenced steering across all task–channel pairs. Each cell reports the largest |∆| in pp relative to the none baseline across cue values within the channel; [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 7.** Figure 7: Sophistication–reliability diagnostic across task–cue conditions. Each point is one condition; the x-axis is the share of outputs selecting a task-specific sophisticated family (e.g., matrix exponentiation for memoization, eval for expression parsing), and the y-axis is pass rate. The aggregate Pearson correlation is near zero (r≈0.09), masking task-specific tradeoffs (memoization with academic personas re… view at source ↗

read the original abstract

Large language models (LLMs) now generate substantial production code, often for tasks with multiple valid algorithmic solutions. Incidental prompt cues, meaning contextual words or metadata outside the task specification, can steer which algorithm the model selects, even when all outputs pass the same tests. Prompt sensitivity is well studied as a tool to improve output quality. Here, output policy means algorithm choice under fixed correctness. We define algorithm steering as cue-induced shifts in algorithm-family distributions and run 46,535 controlled experiments across 11 tasks, 19 cue types (18 channels plus a memoization semantic-vs-surface ablation that preserves meaning while changing typography and punctuation), and 15 model configurations. We find large, systematic shifts in algorithm-family distributions (up to 100 pp), largely consistent with cue semantics, including in applied tasks such as rate limiting. Direct algorithm naming is the most reliable mitigation we tested. Accidental context therefore creates an "invisible lottery" over performance, security, and maintainability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows large cue-driven shifts in LLM algorithm families for code tasks across 46k runs, but the classification step is underspecified.

read the letter

The core finding is that incidental prompt cues can push LLMs toward different algorithm families even when all outputs pass the same tests, with shifts reaching 100 percentage points. They ran this across 11 tasks, 19 cue types, and 15 model setups, including a semantic-vs-surface ablation on memoization cues.

The scale is the strongest part. That many controlled runs gives the distribution shifts some credibility, and the rate-limiting example shows the issue can appear in applied settings. The direct-naming mitigation is simple enough to check in practice.

The soft spot is the algorithm-family labeling. The abstract does not describe how code was mapped to families, whether the process was blinded to the cue, or how they validated consistency across tasks. If labeling relied on surface features or was done without controls, the reported shifts could partly reflect the measurement rather than the model. No mention of error bars, multiple-testing correction, or baseline variance either.

This is for people working on LLM code reliability and prompt effects. A reader who needs to understand non-determinism beyond test passing would find the empirical pattern useful. It deserves peer review because the question is concrete and the experiment count is substantial, though the methods section will need to address the classification pipeline directly.

Referee Report

3 major / 1 minor

Summary. The paper claims that incidental prompt cues (contextual words or metadata outside the task spec) steer LLMs toward different algorithm families in code generation tasks, even when all outputs pass the same tests. It reports results from 46,535 controlled experiments across 11 tasks, 19 cue types, and 15 model configurations, finding systematic shifts in algorithm-family distributions of up to 100 percentage points that align with cue semantics; direct algorithm naming is identified as the most reliable mitigation.

Significance. If the empirical results hold after addressing classification and statistical concerns, the work would be significant for the field of LLM code generation. It provides a large-scale demonstration of how non-task prompt elements influence output policy (algorithm choice under fixed correctness), with direct implications for reproducibility, security, and maintainability in applied settings such as rate limiting. The scale of the experiment set and the inclusion of a semantic-vs-surface ablation are strengths.

major comments (3)

[Abstract / Methods] The abstract and methods description provide no information on the procedure used to classify generated code into algorithm families. This classification step is load-bearing for the central claim of cue-induced shifts, yet no inter-rater protocol, blinding procedure, automated classifier validation, or consistency metrics across tasks are reported.
[Results / Experimental Setup] No details are given on statistical controls, error bars, multiple-testing corrections, or how baseline distributions (without cues) were established and compared. The reported shifts of up to 100 pp cannot be evaluated for robustness without this information.
[Discussion] The weakest assumption—that observed distribution shifts are caused by cue semantics rather than uncontrolled variables in sampling, parsing, or task ambiguity—is not tested. The paper does not report any sensitivity analysis or validation that the family labels are independent of the incidental cue.

minor comments (1)

[Introduction] The term 'output policy' is introduced without a formal definition or contrast to standard usage in the LLM literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important areas for clarification on classification, statistics, and causal attribution. We address each point below and have revised the manuscript to strengthen these aspects.

read point-by-point responses

Referee: [Abstract / Methods] The abstract and methods description provide no information on the procedure used to classify generated code into algorithm families. This classification step is load-bearing for the central claim of cue-induced shifts, yet no inter-rater protocol, blinding procedure, automated classifier validation, or consistency metrics across tasks are reported.

Authors: We agree the original submission under-described the classification pipeline. The revised manuscript adds a dedicated subsection in Methods that defines the algorithm-family taxonomy from standard references, details an automated AST-based classifier with keyword heuristics, reports manual validation on a 10% stratified sample, provides inter-annotator agreement (Fleiss' kappa = 0.81), and describes blinding procedures in which annotators were unaware of cue conditions. A new supplementary table reports per-task consistency metrics. revision: yes
Referee: [Results / Experimental Setup] No details are given on statistical controls, error bars, multiple-testing corrections, or how baseline distributions (without cues) were established and compared. The reported shifts of up to 100 pp cannot be evaluated for robustness without this information.

Authors: We have expanded the Results and Experimental Setup sections to include bootstrap 95% confidence intervals on all reported distribution shifts, Holm-Bonferroni correction across the 11 tasks and 19 cue types, and a precise description of baseline construction (minimum 4,000 no-cue trials per task, with explicit comparison via chi-squared tests). All figures now display error bars and corrected p-values. revision: yes
Referee: [Discussion] The weakest assumption—that observed distribution shifts are caused by cue semantics rather than uncontrolled variables in sampling, parsing, or task ambiguity—is not tested. The paper does not report any sensitivity analysis or validation that the family labels are independent of the incidental cue.

Authors: The original manuscript already contains the memoization semantic-vs-surface ablation, which isolates semantics while holding surface form constant. We have added further sensitivity analyses in the revised Discussion: (i) re-running subsets at varied temperatures and top-p values yields stable shifts; (ii) task rephrasings that preserve semantics but alter surface wording produce unchanged family distributions; (iii) an explicit check confirms cue tokens do not alter the AST parser or family-labeling logic. These results support attribution to cue semantics. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement of prompt effects

full rationale

The paper reports results from 46,535 controlled experiments measuring shifts in algorithm-family distributions induced by incidental prompt cues. No equations, fitted models, predictions, or first-principles derivations appear in the provided text; the central claim is an observed empirical pattern across tasks and models. Algorithm-family labeling is a methodological classification step whose consistency is not secured by any self-referential definition or self-citation chain. No load-bearing self-citations, ansatzes, or renamings of known results are present. The work is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is purely empirical and introduces no new free parameters, axioms beyond standard statistical inference, or invented entities. It relies on the background assumption that algorithm families are identifiable and that cue semantics can be isolated.

axioms (2)

domain assumption Algorithm families can be reliably and consistently identified from generated code outputs across different tasks
Required to convert raw model outputs into the reported distribution shifts
standard math Statistical significance testing can establish that observed shifts are due to the tested cues
Implicit in claims of large, systematic shifts

pith-pipeline@v0.9.1-grok · 5716 in / 1431 out tokens · 23344 ms · 2026-06-28T08:56:39.257676+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 19 canonical work pages · 8 internal anchors

[1]

Code2vec: Learning distributed representations of code,

Alon, U., Zilberstein, M., Levy, O., and Yahav, E. code2vec : Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3 0 (POPL): 0 40:1--40:29, January 2019. doi:10.1145/3290353

work page doi:10.1145/3290353 2019
[2]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Grounded copilot: How programmers interact with code-generating models,

Barke, S., James, M. B., and Polikarpova, N. Grounded Copilot : How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages, 7 0 (OOPSLA1): 0 85--111, April 2023. doi:10.1145/3586030

work page doi:10.1145/3586030 2023
[4]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J. E., and Stoica, I. Chatbot arena: An open platform for evaluating LLMs by human preference. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 8359--8388. ...

2024
[6]

S., Reid, M., Matsuo, Y., and Iwasawa, Y

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pp.\ 22199--22213. Curran Associates, Inc., 2022

2022
[7]

H., Wang, C., Huang, J.-T., and Lyu, M

Lam, M. H., Wang, C., Huang, J.-T., and Lyu, M. R. CodeCrash : Exposing LLM fragility to misleading natural language in code reasoning. In Advances in Neural Information Processing Systems, volume 38, pp.\ 120782--120828. Curran Associates, Inc., 2025

2025
[8]

How diversely can language models solve problems? exploring the algorithmic diversity of model-generated code

Lee, S., Chon, H., Jang, J., Lee, D., and Yu, H. How diversely can language models solve problems? exploring the algorithmic diversity of model-generated code. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 152--167, Suzhou, China, November 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-emnlp.10

work page doi:10.18653/v1/2025.findings-emnlp.10 2025
[9]

S., Wang, Y., and Zhang, L

Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36, pp.\ 21558--21572. Curran Associates, Inc., 2023

2023
[10]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[11]

In: Proc

Mozannar, H., Bansal, G., Fourney, A., and Horvitz, E. Reading between the lines: Modeling user behavior and costs in AI -assisted programming. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, 2024. doi:10.1145/3613904.3641936

work page doi:10.1145/3613904.3641936 2024
[12]

In-context Learning and Induction Heads

Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context learning and inducti...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Prompting

OpenAI . Prompting. OpenAI API documentation, 2026. URL https://platform.openai.com/docs/guides/prompting

2026
[14]

Gutfleisch, J

Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. Asleep at the keyboard? assessing the security of GitHub Copilot 's code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pp.\ 754--768. IEEE, 2022. doi:10.1109/SP46214.2022.9833571

work page doi:10.1109/sp46214.2022.9833571 2022
[15]

Improving reproducibility in machine learning research ( A report from the NeurIPS 2019 reproducibility program)

Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivi \`e re, V., Beygelzimer, A., d'Alch \'e Buc, F., Fox, E., and Larochelle, H. Improving reproducibility in machine learning research ( A report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research, 22 0 (164): 0 1--20, 2021

2019
[16]

Rice, J. R. The algorithm selection problem. In Advances in Computers, volume 15, pp.\ 65--118. Academic Press, 1976. doi:10.1016/S0065-2458(08)60520-3

work page doi:10.1016/s0065-2458(08)60520-3 1976
[17]

Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

Sclar, M., Choi, Y., Tsvetkov, Y., and Suhr, A. Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. In International Conference on Learning Representations, volume 2024, pp.\ 25055--25083, 2024

2024
[18]

and Zhang, T

Tian, Y. and Zhang, T. Selective prompt anchoring for code generation. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pp.\ 59528--59551. PMLR, 2025

2025
[19]

ACM Trans Softw Eng Methodol 34(8):225:1--225:53, doi:10.1145/3722108, ://doi.org/10.1145/3722108

Tony, C., D \'i az Ferreyra, N. E., Mutas, M., Dhif, S., and Scandariato, R. Prompting techniques for secure code generation: A systematic investigation. ACM Transactions on Software Engineering and Methodology, 34 0 (8), October 2025. doi:10.1145/3722108

work page doi:10.1145/3722108 2025
[20]

Vaithilingam, P., Zhang, T., and Glassman, E. L. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA '22). Association for Computing Machinery, 2022. doi:10.1145/3491101.3519665

work page doi:10.1145/3491101.3519665 2022
[21]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J., and Beutel, A. The instruction hierarchy: Training LLMs to prioritize privileged instructions, 2024. URL https://arxiv.org/abs/2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses ,

Wang, C., Huang, K., Zhang, J., Feng, Y., Zhang, L., Liu, Y., and Peng, X. LLMs meet library evolution: Evaluating deprecated API usage in LLM -based code completion. In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering, pp.\ 885--897. IEEE, 2025. doi:10.1109/ICSE55347.2025.00245

work page doi:10.1109/icse55347.2025.00245 2025
[23]

Tree of thoughts: Deliberate problem solving with large language models

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, pp.\ 11809--11822. Curran Associates, Inc., 2023

2023
[24]

When ``A Helpful Assistant'' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

Zheng, M., Pei, J., Logeswaran, L., Lee, M., and Jurgens, D. When ``a helpful assistant'' is not really helpful: Personas in system prompts do not improve performances of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 15126--15154, Miami, Florida, USA, November 2024. Association for Computational Ling...

work page doi:10.18653/v1/2024.findings-emnlp.888 2024
[25]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title =

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =

2024
[26]

Proceedings of the IEEE/ACM 47th International Conference on Software Engineering , pages =

Wang, Chong and Huang, Kaifeng and Zhang, Jian and Feng, Yebo and Zhang, Lyuye and Liu, Yang and Peng, Xin , title =. Proceedings of the IEEE/ACM 47th International Conference on Software Engineering , pages =. 2025 , publisher =

2025
[27]

International Conference on Learning Representations , volume=

Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , title =. International Conference on Learning Representations , volume=
[28]

Advances in Neural Information Processing Systems , volume =

Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Advances in Neural Information Processing Systems , volume =. 2022 , publisher =

2022
[29]

Advances in Neural Information Processing Systems , volume =

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Tom and Cao, Yuan and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems , volume =. 2023 , publisher =

2023
[30]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace and Kai Xiao and Reimar Leike and Lilian Weng and Johannes Heidecke and Alex Beutel , year =. The Instruction Hierarchy: Training. doi:10.48550/arXiv.2404.13208 , url =. 2404.13208 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.13208
[31]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

Zheng, Mingqian and Pei, Jiaxin and Logeswaran, Lajanugen and Lee, Moontae and Jurgens, David , title =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =. 2024 , month = nov, address =

2024
[32]

Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and Ray, Alex and Puri, Raul and Krueger, Gretchen and Petrov, Michael and Khlaaf, Heidy and Sastry, Girish and Mishkin, Pamela and Chan, Brooke and Gray, Scott and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
[33]

Program Synthesis with Large Language Models

Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and Sutton, Charles , title =. 2021 , eprint =. doi:10.48550/arXiv.2108.07732 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021
[34]

Proceedings of the 42nd International Conference on Machine Learning , series =

Tian, Yuan and Zhang, Tianyi , title =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , publisher =

2025
[35]

Prompting Techniques for Secure Code Generation: A Systematic Investigation , journal =

Tony, Catherine and D. Prompting Techniques for Secure Code Generation: A Systematic Investigation , journal =. 2025 , month = oct, publisher =

2025
[36]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

Lee, Seonghyeon and Chon, HeeJae and Jang, Joonwon and Lee, Dongha and Yu, Hwanjo , title =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =. 2025 , month = nov, address =

2025
[37]

, title =

Rice, John R. , title =. Advances in Computers , volume =. 1976 , publisher =

1976
[38]

, title =

Vaithilingam, Priyan and Zhang, Tianyi and Glassman, Elena L. , title =. Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA '22) , articleno =. 2022 , publisher =

2022
[39]

and Polikarpova, Nadia , title =

Barke, Shraddha and James, Michael B. and Polikarpova, Nadia , title =. Proceedings of the ACM on Programming Languages , volume =. 2023 , month = apr, publisher =

2023
[40]

Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24) , articleno =

Mozannar, Hussein and Bansal, Gagan and Fourney, Adam and Horvitz, Eric , title =. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24) , articleno =. 2024 , publisher =

2024
[41]

Proceedings of the ACM on Programming Languages , volume =

Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran , title =. Proceedings of the ACM on Programming Languages , volume =. 2019 , month = jan, publisher =

2019
[42]

In-context Learning and Induction Heads

Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Johnston, Scott and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.11895 2022
[43]

2022 IEEE Symposium on Security and Privacy (SP) , pages =

Pearce, Hammond and Ahmad, Baleegh and Tan, Benjamin and Dolan-Gavitt, Brendan and Karri, Ramesh , title =. 2022 IEEE Symposium on Security and Privacy (SP) , pages =. 2022 , publisher =

2022
[44]

2026 , url =

Prompting , howpublished =. 2026 , url =

2026
[45]

Improving Reproducibility in Machine Learning Research (

Pineau, Joelle and Vincent-Lamarre, Philippe and Sinha, Koustuv and Larivi. Improving Reproducibility in Machine Learning Research (. Journal of Machine Learning Research , volume =
[46]

and Stoica, Ion , title =

Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , publisher =

2024
[47]

Advances in Neural Information Processing Systems , volume =

Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , title =. Advances in Neural Information Processing Systems , volume =. 2023 , publisher =

2023
[48]

, title =

Lam, Man Ho and Wang, Chaozheng and Huang, Jen-Tse and Lyu, Michael R. , title =. Advances in Neural Information Processing Systems , volume =. 2025 , pages=

2025

[1] [1]

Code2vec: Learning distributed representations of code,

Alon, U., Zilberstein, M., Levy, O., and Yahav, E. code2vec : Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3 0 (POPL): 0 40:1--40:29, January 2019. doi:10.1145/3290353

work page doi:10.1145/3290353 2019

[2] [2]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Grounded copilot: How programmers interact with code-generating models,

Barke, S., James, M. B., and Polikarpova, N. Grounded Copilot : How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages, 7 0 (OOPSLA1): 0 85--111, April 2023. doi:10.1145/3586030

work page doi:10.1145/3586030 2023

[4] [4]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J. E., and Stoica, I. Chatbot arena: An open platform for evaluating LLMs by human preference. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 8359--8388. ...

2024

[6] [6]

S., Reid, M., Matsuo, Y., and Iwasawa, Y

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pp.\ 22199--22213. Curran Associates, Inc., 2022

2022

[7] [7]

H., Wang, C., Huang, J.-T., and Lyu, M

Lam, M. H., Wang, C., Huang, J.-T., and Lyu, M. R. CodeCrash : Exposing LLM fragility to misleading natural language in code reasoning. In Advances in Neural Information Processing Systems, volume 38, pp.\ 120782--120828. Curran Associates, Inc., 2025

2025

[8] [8]

How diversely can language models solve problems? exploring the algorithmic diversity of model-generated code

Lee, S., Chon, H., Jang, J., Lee, D., and Yu, H. How diversely can language models solve problems? exploring the algorithmic diversity of model-generated code. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 152--167, Suzhou, China, November 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-emnlp.10

work page doi:10.18653/v1/2025.findings-emnlp.10 2025

[9] [9]

S., Wang, Y., and Zhang, L

Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36, pp.\ 21558--21572. Curran Associates, Inc., 2023

2023

[10] [10]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[11] [11]

In: Proc

Mozannar, H., Bansal, G., Fourney, A., and Horvitz, E. Reading between the lines: Modeling user behavior and costs in AI -assisted programming. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, 2024. doi:10.1145/3613904.3641936

work page doi:10.1145/3613904.3641936 2024

[12] [12]

In-context Learning and Induction Heads

Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context learning and inducti...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Prompting

OpenAI . Prompting. OpenAI API documentation, 2026. URL https://platform.openai.com/docs/guides/prompting

2026

[14] [14]

Gutfleisch, J

Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. Asleep at the keyboard? assessing the security of GitHub Copilot 's code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pp.\ 754--768. IEEE, 2022. doi:10.1109/SP46214.2022.9833571

work page doi:10.1109/sp46214.2022.9833571 2022

[15] [15]

Improving reproducibility in machine learning research ( A report from the NeurIPS 2019 reproducibility program)

Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivi \`e re, V., Beygelzimer, A., d'Alch \'e Buc, F., Fox, E., and Larochelle, H. Improving reproducibility in machine learning research ( A report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research, 22 0 (164): 0 1--20, 2021

2019

[16] [16]

Rice, J. R. The algorithm selection problem. In Advances in Computers, volume 15, pp.\ 65--118. Academic Press, 1976. doi:10.1016/S0065-2458(08)60520-3

work page doi:10.1016/s0065-2458(08)60520-3 1976

[17] [17]

Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

Sclar, M., Choi, Y., Tsvetkov, Y., and Suhr, A. Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. In International Conference on Learning Representations, volume 2024, pp.\ 25055--25083, 2024

2024

[18] [18]

and Zhang, T

Tian, Y. and Zhang, T. Selective prompt anchoring for code generation. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pp.\ 59528--59551. PMLR, 2025

2025

[19] [19]

ACM Trans Softw Eng Methodol 34(8):225:1--225:53, doi:10.1145/3722108, ://doi.org/10.1145/3722108

Tony, C., D \'i az Ferreyra, N. E., Mutas, M., Dhif, S., and Scandariato, R. Prompting techniques for secure code generation: A systematic investigation. ACM Transactions on Software Engineering and Methodology, 34 0 (8), October 2025. doi:10.1145/3722108

work page doi:10.1145/3722108 2025

[20] [20]

Vaithilingam, P., Zhang, T., and Glassman, E. L. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA '22). Association for Computing Machinery, 2022. doi:10.1145/3491101.3519665

work page doi:10.1145/3491101.3519665 2022

[21] [21]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J., and Beutel, A. The instruction hierarchy: Training LLMs to prioritize privileged instructions, 2024. URL https://arxiv.org/abs/2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses ,

Wang, C., Huang, K., Zhang, J., Feng, Y., Zhang, L., Liu, Y., and Peng, X. LLMs meet library evolution: Evaluating deprecated API usage in LLM -based code completion. In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering, pp.\ 885--897. IEEE, 2025. doi:10.1109/ICSE55347.2025.00245

work page doi:10.1109/icse55347.2025.00245 2025

[23] [23]

Tree of thoughts: Deliberate problem solving with large language models

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, pp.\ 11809--11822. Curran Associates, Inc., 2023

2023

[24] [24]

When ``A Helpful Assistant'' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

Zheng, M., Pei, J., Logeswaran, L., Lee, M., and Jurgens, D. When ``a helpful assistant'' is not really helpful: Personas in system prompts do not improve performances of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 15126--15154, Miami, Florida, USA, November 2024. Association for Computational Ling...

work page doi:10.18653/v1/2024.findings-emnlp.888 2024

[25] [25]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title =

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =

2024

[26] [26]

Proceedings of the IEEE/ACM 47th International Conference on Software Engineering , pages =

Wang, Chong and Huang, Kaifeng and Zhang, Jian and Feng, Yebo and Zhang, Lyuye and Liu, Yang and Peng, Xin , title =. Proceedings of the IEEE/ACM 47th International Conference on Software Engineering , pages =. 2025 , publisher =

2025

[27] [27]

International Conference on Learning Representations , volume=

Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , title =. International Conference on Learning Representations , volume=

[28] [28]

Advances in Neural Information Processing Systems , volume =

Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Advances in Neural Information Processing Systems , volume =. 2022 , publisher =

2022

[29] [29]

Advances in Neural Information Processing Systems , volume =

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Tom and Cao, Yuan and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems , volume =. 2023 , publisher =

2023

[30] [30]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace and Kai Xiao and Reimar Leike and Lilian Weng and Johannes Heidecke and Alex Beutel , year =. The Instruction Hierarchy: Training. doi:10.48550/arXiv.2404.13208 , url =. 2404.13208 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.13208

[31] [31]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

Zheng, Mingqian and Pei, Jiaxin and Logeswaran, Lajanugen and Lee, Moontae and Jurgens, David , title =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =. 2024 , month = nov, address =

2024

[32] [32]

Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and Ray, Alex and Puri, Raul and Krueger, Gretchen and Petrov, Michael and Khlaaf, Heidy and Sastry, Girish and Mishkin, Pamela and Chan, Brooke and Gray, Scott and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021

[33] [33]

Program Synthesis with Large Language Models

Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and Sutton, Charles , title =. 2021 , eprint =. doi:10.48550/arXiv.2108.07732 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021

[34] [34]

Proceedings of the 42nd International Conference on Machine Learning , series =

Tian, Yuan and Zhang, Tianyi , title =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , publisher =

2025

[35] [35]

Prompting Techniques for Secure Code Generation: A Systematic Investigation , journal =

Tony, Catherine and D. Prompting Techniques for Secure Code Generation: A Systematic Investigation , journal =. 2025 , month = oct, publisher =

2025

[36] [36]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

Lee, Seonghyeon and Chon, HeeJae and Jang, Joonwon and Lee, Dongha and Yu, Hwanjo , title =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =. 2025 , month = nov, address =

2025

[37] [37]

, title =

Rice, John R. , title =. Advances in Computers , volume =. 1976 , publisher =

1976

[38] [38]

, title =

Vaithilingam, Priyan and Zhang, Tianyi and Glassman, Elena L. , title =. Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA '22) , articleno =. 2022 , publisher =

2022

[39] [39]

and Polikarpova, Nadia , title =

Barke, Shraddha and James, Michael B. and Polikarpova, Nadia , title =. Proceedings of the ACM on Programming Languages , volume =. 2023 , month = apr, publisher =

2023

[40] [40]

Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24) , articleno =

Mozannar, Hussein and Bansal, Gagan and Fourney, Adam and Horvitz, Eric , title =. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24) , articleno =. 2024 , publisher =

2024

[41] [41]

Proceedings of the ACM on Programming Languages , volume =

Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran , title =. Proceedings of the ACM on Programming Languages , volume =. 2019 , month = jan, publisher =

2019

[42] [42]

In-context Learning and Induction Heads

Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Johnston, Scott and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.11895 2022

[43] [43]

2022 IEEE Symposium on Security and Privacy (SP) , pages =

Pearce, Hammond and Ahmad, Baleegh and Tan, Benjamin and Dolan-Gavitt, Brendan and Karri, Ramesh , title =. 2022 IEEE Symposium on Security and Privacy (SP) , pages =. 2022 , publisher =

2022

[44] [44]

2026 , url =

Prompting , howpublished =. 2026 , url =

2026

[45] [45]

Improving Reproducibility in Machine Learning Research (

Pineau, Joelle and Vincent-Lamarre, Philippe and Sinha, Koustuv and Larivi. Improving Reproducibility in Machine Learning Research (. Journal of Machine Learning Research , volume =

[46] [46]

and Stoica, Ion , title =

Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , publisher =

2024

[47] [47]

Advances in Neural Information Processing Systems , volume =

Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , title =. Advances in Neural Information Processing Systems , volume =. 2023 , publisher =

2023

[48] [48]

, title =

Lam, Man Ho and Wang, Chaozheng and Huang, Jen-Tse and Lyu, Michael R. , title =. Advances in Neural Information Processing Systems , volume =. 2025 , pages=

2025