pith. machine review for the scientific record. sign in

arxiv: 2605.00382 · v3 · submitted 2026-05-01 · 💻 cs.SE · cs.AI· cs.SI

Recognition: unknown

Social Bias in LLM-Generated Code: Benchmark and Mitigation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:30 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.SI
keywords social biasLLM code generationfairness in AImulti-agent systemscode benchmarkbias mitigationsoftware fairness
0
0 comments X

The pith

The Fairness Monitor Agent reduces social bias in LLM-generated code by 65.1 percent while improving functional correctness from 75.80 to 83.97 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models produce code that embeds measurable social biases across demographic attributes such as gender, race, and age. On a benchmark of 343 real-world coding tasks, four prominent models show Code Bias Scores as high as 60.58 percent. Common prompt techniques including chain-of-thought reasoning and fairness personas increase rather than decrease bias. Structured multi-agent pipelines lower bias only when early agents explicitly limit which attributes the code may consider. The authors introduce the Fairness Monitor Agent, a modular addition that determines restricted attributes from the task description and corrects violations through iterative review.

Core claim

Social bias reaches up to 60.58 percent in code generated by leading LLMs across seven demographic dimensions. Standard prompt interventions and diffuse fairness instructions often worsen outcomes. The Fairness Monitor Agent, inserted into any existing pipeline, first scopes which attributes the code should ignore or restrict, then detects and repairs violations without needing an executable test suite. Across all 343 tasks this reduces bias by 65.1 percent relative to a baseline developer agent and raises the rate of functionally correct code from 75.80 percent to 83.97 percent, outperforming the other approaches tested.

What carries the argument

The Fairness Monitor Agent, a plug-in module that extracts demographic attributes to consider or restrict from the task description, then performs iterative detection and correction of violations in generated code.

If this is right

  • Bias reduction is possible by adding a targeted review step without changing the base model or generation process.
  • Explicit fairness instructions given to every agent role increase bias compared with giving none.
  • Early definition of which demographic attributes the code must ignore produces better fairness than later or distributed instructions.
  • The method works without access to executable test cases or runtime verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A similar scoped-review component could be added to pipelines that generate other artifacts such as documentation or test cases.
  • Separating fairness enforcement into one dedicated role may generalize better than embedding fairness prompts throughout an entire workflow.
  • The same scoping logic could be applied to other constraints such as security or performance rules that are hard to test automatically.

Load-bearing premise

The 343 tasks and the Code Bias Score together capture the social biases that matter in real deployed software.

What would settle it

Applying the Fairness Monitor Agent to an independent set of production coding tasks and finding that bias scores do not drop by a comparable margin or that functional correctness does not increase.

Figures

Figures reproduced from arXiv: 2605.00382 by Fazle Rabbi, Jinqiu Yang, Lin Ling, Song Wang.

Figure 1
Figure 1. Figure 1: Overview of the Solar fairness evaluation framework. interventions such as Chain-of-Thought and positive role-playing amplify rather than reduce it. (3) We investigate social bias in multi-agent code generation workflows using FlowGen, examining how workflow structure, fairness-aware role instruc￾tions, and role composition affect bias in the generated code. (4) We propose FMA, a modular, oracle-free fairn… view at source ↗
Figure 2
Figure 2. Figure 2: Task definition example from SocialBias-Bench. – Related attributes: Task-specific features relevant to completing the cod￾ing logic (e.g., GPA for admissions, dietary habits for health exams). Crucially, we strive to avoid misleading code prompts. Unlike prior work that passes protected attributes directly as method parameters, our method headers use (self) as the sole parameter. This design discourages t… view at source ↗
Figure 3
Figure 3. Figure 3: Automatically generated code prompt from view at source ↗
Figure 4
Figure 4. Figure 4: Example of biased code generated by the LLM, excluding transgender view at source ↗
Figure 5
Figure 5. Figure 5: An executable metamorphic test case generated by view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the FMA pipeline architecture. – Restricted attributes: all remaining attributes, particularly demographic characteristics such as gender, race, and religion, which must not influence the generated code. This classification is derived entirely from the task’s Docstring and type infor￾mation, without any dataset-specific configuration. The agent adopts a closed￾world assumption: any attributes n… view at source ↗
Figure 7
Figure 7. Figure 7: Radar charts: Bias Leaning Ratio of seven demographic dimensions on view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leaving social bias in LLM-generated code largely unexamined. Extending our prior work on Solar, we conduct a comprehensive empirical study using SocialBias-Bench, a benchmark of 343 real-world coding tasks spanning seven demographic dimensions. We evaluate four prominent LLMs and find severe bias across all models, with Code Bias Scores reaching up to 60.58%. We further show that standard prompt-level interventions, such as Chain-of-Thought reasoning and fairness persona assignment, inadvertently amplify bias rather than reduce it. We then investigate whether structured multi-agent software process frameworks can improve fairness, finding that structured pipelines reduce bias when early roles correctly scope what the code should and should not consider. However, adding explicit fairness instructions to all agent roles produces worse outcomes than providing none, suggesting that diffused responsibility goes unaddressed. To address these limitations, we propose the Fairness Monitor Agent (FMA), a modular component that plugs into any existing code generation pipeline without modifying it. FMA analyzes the task description to determine which attributes should be considered or restricted, then detects and corrects violations through an iterative review process, without requiring an executable test suite. Evaluated on all 343 tasks, FMA reduces bias by 65.1% compared to a developer agent alone and improves functional correctness from 75.80% to 83.97%, outperforming all other studied approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SocialBias-Bench, a benchmark of 343 real-world coding tasks spanning seven demographic dimensions, to evaluate social bias in code generated by four prominent LLMs. It reports high bias levels (Code Bias Scores up to 60.58%), shows that standard prompt interventions like Chain-of-Thought and fairness personas can amplify bias, and finds that structured multi-agent pipelines reduce bias only when early roles properly scope demographic considerations. The authors propose the Fairness Monitor Agent (FMA), a modular plug-in component that analyzes tasks, detects violations, and corrects them iteratively without needing test suites; on the full benchmark, FMA reduces bias by 65.1% relative to a baseline developer agent while raising functional correctness from 75.80% to 83.97%.

Significance. If the benchmark and metric prove valid, the work is significant for software engineering and responsible AI: it fills a gap in fairness evaluation for code generation, demonstrates counter-intuitive effects of common interventions, and supplies a practical, non-intrusive mitigation that integrates with existing pipelines. The dual improvement in fairness and correctness is a notable strength.

major comments (2)
  1. [§3 and §4] §3 (SocialBias-Bench construction) and §4 (Code Bias Score definition): All headline quantitative results (65.1% bias reduction, correctness lift from 75.80% to 83.97%) rest exclusively on the Code Bias Score applied to the 343 tasks. The manuscript reports no inter-rater agreement with human fairness judgments, no coverage analysis across the seven demographic dimensions, and no comparison against existing bias benchmarks; without such validation the deltas could be artifacts of the chosen metric rather than evidence of improved real-world fairness.
  2. [§5] §5 (FMA evaluation): The claim that FMA outperforms all other studied approaches is load-bearing for the contribution, yet the text does not clarify whether the comparison agents received equivalent prompting resources or iteration budgets, nor does it report statistical significance tests or variance across the 343 tasks; this weakens the superiority conclusion.
minor comments (2)
  1. [Abstract] The abstract references 'extending our prior work on Solar' but the manuscript does not supply a citation or brief summary of that work, which would help readers situate the current contribution.
  2. [§3] Notation for the seven demographic dimensions is introduced without an explicit table or figure summarizing their distribution in the 343 tasks; adding this would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (SocialBias-Bench construction) and §4 (Code Bias Score definition): All headline quantitative results (65.1% bias reduction, correctness lift from 75.80% to 83.97%) rest exclusively on the Code Bias Score applied to the 343 tasks. The manuscript reports no inter-rater agreement with human fairness judgments, no coverage analysis across the seven demographic dimensions, and no comparison against existing bias benchmarks; without such validation the deltas could be artifacts of the chosen metric rather than evidence of improved real-world fairness.

    Authors: We agree that further validation would strengthen the work. SocialBias-Bench consists of 343 manually curated tasks drawn from real-world coding scenarios across seven demographic dimensions, and the Code Bias Score quantifies inappropriate demographic references in generated code. We will add a coverage analysis table showing task distribution across dimensions and a new subsection relating the metric to prior fairness benchmarks in NLP and SE. However, a full inter-rater agreement study with human judges is not feasible within the current scope. The observed deltas align with patterns in the curated tasks rather than metric artifacts, but we accept the need for these additions. revision: partial

  2. Referee: [§5] §5 (FMA evaluation): The claim that FMA outperforms all other studied approaches is load-bearing for the contribution, yet the text does not clarify whether the comparison agents received equivalent prompting resources or iteration budgets, nor does it report statistical significance tests or variance across the 343 tasks; this weakens the superiority conclusion.

    Authors: We thank the referee for highlighting this. All agents in the comparisons (baseline developer agent, Chain-of-Thought, persona-based, and other multi-agent pipelines) received equivalent prompting structures and iteration budgets, as specified in the experimental protocol. We will revise §5 to explicitly document this equivalence and add statistical significance tests (e.g., paired Wilcoxon tests) together with variance measures (standard deviation of bias and correctness scores across the 343 tasks). These changes will be presented in updated tables and text. revision: yes

standing simulated objections not resolved
  • A complete inter-rater agreement study involving human fairness judgments on the 343 tasks would require new large-scale annotation that exceeds the resources and scope of the present manuscript.

Circularity Check

0 steps flagged

Minor self-citation to prior Solar work; empirical evaluation otherwise self-contained

full rationale

This is an empirical study that introduces SocialBias-Bench (343 tasks) and the FMA agent, then reports measured bias reduction and correctness gains on that benchmark. No equations, fitted parameters, or derivations exist that could reduce to inputs by construction. The single self-reference to prior Solar work appears only as scene-setting and is not invoked to justify uniqueness, forbid alternatives, or carry any quantitative claim. All headline numbers (65.1% bias reduction, 75.80% to 83.97% correctness) are direct empirical outputs on the newly defined benchmark and metric, which is the normal, non-circular pattern for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the validity of the new benchmark tasks and bias metric, plus the assumption that the proposed agent can be evaluated without executable tests.

axioms (2)
  • domain assumption The Code Bias Score is an appropriate and sufficient metric for quantifying social bias in generated code.
    Used throughout to report bias levels up to 60.58% and reductions of 65.1%.
  • domain assumption The 343 tasks in SocialBias-Bench represent a comprehensive and representative sample of real-world coding scenarios involving demographic considerations.
    Basis for all evaluations and claims about severe bias across models.
invented entities (1)
  • Fairness Monitor Agent (FMA) no independent evidence
    purpose: Analyzes task descriptions to determine relevant demographic attributes, detects violations, and corrects generated code through iterative review.
    New modular component proposed to address limitations of prompt and multi-agent interventions.

pith-pipeline@v0.9.0 · 5582 in / 1556 out tokens · 33808 ms · 2026-05-09T19:30:58.048433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

    cs.SE 2026-05 unverdicted novelty 7.0

    HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.

  2. HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

    cs.SE 2026-05 accept novelty 7.0

    LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.

  3. Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation

    cs.SE 2026-05 unverdicted novelty 7.0

    Many reported failures in LLM-based code translation are false negatives due to evaluation pipeline issues such as improper compilation flags, missing library links, and unconfigured runtime environments rather than i...

  4. Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation

    cs.SE 2026-05 unverdicted novelty 5.0

    A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.

Reference graph

Works this paper leans on

146 extracted references · 40 canonical work pages · cited by 2 Pith papers · 14 internal anchors

  1. [1]

    arXiv preprint , note=

    Codegen: An open large language model for code with multi-turn program synthesis , author=. arXiv preprint , note=

  2. [2]

    arXiv preprint , note=

    Program synthesis with large language models , author=. arXiv preprint , note=

  3. [3]

    and Sarro, Federica and Harman, Mark , title =

    Chen, Zhenpeng and Zhang, Jie M. and Sarro, Federica and Harman, Mark , title =. 2024 , booktitle =

  4. [4]

    Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining , pages=

    Algorithmic decision making and the cost of fairness , author=. Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining , pages=

  5. [5]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Upstream mitigation is not all you need: Testing the bias transfer hypothesis in pre-trained language models , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  6. [6]

    arXiv preprint , note=

    On measures of biases and harms in NLP , author=. arXiv preprint , note=

  7. [7]

    Political Research Quarterly , volume=

    Religious stereotyping and voter support for evangelical candidates , author=. Political Research Quarterly , volume=. 2009 , publisher=

  8. [8]

    arXiv preprint , note=

    StereoSet: Measuring stereotypical bias in pretrained language models , author=. arXiv preprint , note=

  9. [9]

    arXiv preprint , note=

    Gender bias in coreference resolution: Evaluation and debiasing methods , author=. arXiv preprint , note=

  10. [10]

    arXiv preprint , note=

    CrowS-pairs: A challenge dataset for measuring social biases in masked language models , author=. arXiv preprint , note=

  11. [11]

    arXiv preprint , note=

    FairBench: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models , author=. arXiv preprint , note=

  12. [12]

    arXiv preprint , note=

    GPTBIAS: A comprehensive framework for evaluating bias in large language models , author=. arXiv preprint , note=

  13. [13]

    arXiv preprint , note=

    An empirical survey of the effectiveness of debiasing techniques for pre-trained language models , author=. arXiv preprint , note=

  14. [14]

    arXiv preprint , note=

    Bias and fairness in large language models: A survey , author=. arXiv preprint , note=

  15. [15]

    Advances in neural information processing systems , volume=

    Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models , author=. Advances in neural information processing systems , volume=

  16. [16]

    Proceedings of The ACM Collective Intelligence Conference , pages=

    Gender bias and stereotypes in large language models , author=. Proceedings of The ACM Collective Intelligence Conference , pages=

  17. [17]

    International Conference on Machine Learning , pages=

    Towards understanding and mitigating social biases in language models , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  18. [18]

    arXiv preprint , note=

    Unified detoxifying and debiasing in language generation via inference-time adaptive optimization , author=. arXiv preprint , note=

  19. [19]

    Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) , pages=

    A taxonomy of bias-causing ambiguities in machine translation , author=. Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) , pages=

  20. [20]

    Annual review of psychology , volume=

    Gender stereotypes , author=. Annual review of psychology , volume=. 2018 , publisher=

  21. [21]

    Proceedings of the 2018 chi conference on human factors in computing systems , pages=

    Addressing age-related bias in sentiment analysis , author=. Proceedings of the 2018 chi conference on human factors in computing systems , pages=

  22. [22]

    arXiv preprint , note=

    Does gender matter? towards fairness in dialogue systems , author=. arXiv preprint , note=

  23. [23]

    arXiv preprint , note=

    Beyond accuracy: Behavioral testing of NLP models with CheckList , author=. arXiv preprint , note=

  24. [24]

    Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

    Biasasker: Measuring the bias in conversational ai system , author=. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

  25. [25]

    ACM Transactions on Software Engineering and Methodology , volume=

    TESTSGD: Interpretable testing of neural networks against subtle group discrimination , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2023 , publisher=

  26. [26]

    Proceedings of the 2017 11th Joint meeting on foundations of software engineering , pages=

    Fairness testing: testing software for discrimination , author=. Proceedings of the 2017 11th Joint meeting on foundations of software engineering , pages=

  27. [27]

    arXiv preprint , note=

    BBQ: A hand-built bias benchmark for question answering , author=. arXiv preprint , note=

  28. [28]

    Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    Do neural ranking models intensify gender bias? , author=. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  29. [29]

    Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

    Bold: Dataset and metrics for measuring biases in open-ended language generation , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

  30. [30]

    PloS one , volume=

    Hate speech detection and racial bias mitigation in social media based on BERT model , author=. PloS one , volume=. 2020 , publisher=

  31. [31]

    Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

    The risk of racial bias in hate speech detection , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

  32. [32]

    Advances in Neural Information Processing Systems , volume=

    Uncovering and quantifying social biases in code generation , author=. Advances in Neural Information Processing Systems , volume=

  33. [33]

    2023 , journal=

    Bias Testing and Mitigation in LLM-based Code Generation , author=. 2023 , journal=

  34. [34]

    arXiv preprint , note=

    StarCoder: may the source be with you! , author=. arXiv preprint , note=

  35. [35]

    arXiv preprint , note=

    Code llama: Open foundation models for code , author=. arXiv preprint , note=

  36. [36]

    arXiv preprint , note=

    Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x , author=. arXiv preprint , note=

  37. [37]

    arXiv preprint , note=

    Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering , author=. arXiv preprint , note=

  38. [38]

    arXiv preprint , note=

    Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT , author=. arXiv preprint , note=

  39. [39]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  40. [40]

    arXiv preprint , note=

    Piloting copilot and codex: Hot temperature, cold prompts, or black magic? , author=. arXiv preprint , note=

  41. [41]

    arXiv preprint , note=

    Metamorphic testing: a new approach for generating next test cases , author=. arXiv preprint , note=

  42. [42]

    arXiv preprint , note=

    Towards controllable biases in language generation , author=. arXiv preprint , note=

  43. [43]

    Knowledge-based systems , volume=

    Textx: a python tool for domain-specific languages implementation , author=. Knowledge-based systems , volume=. 2017 , publisher=

  44. [44]

    arXiv preprint , note=

    Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint , note=

  45. [45]

    2024 , howpublished =

    Wikipedia , title =. 2024 , howpublished =

  46. [46]

    2022 , howpublished =

    OpenAI , title =. 2022 , howpublished =

  47. [47]

    2023 , howpublished =

    Google , title =. 2023 , howpublished =

  48. [48]

    2024 , howpublished =

    Meta , title =. 2024 , howpublished =

  49. [49]

    2024 , howpublished =

    Anthropic , title =. 2024 , howpublished =

  50. [50]

    arXiv preprint , note=

    SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents , author=. arXiv preprint , note=

  51. [51]

    2024 , eprint=

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation , author=. 2024 , eprint=

  52. [52]

    2024 , eprint=

    Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step , author=. 2024 , eprint=

  53. [53]

    2024 , eprint=

    Self-collaboration Code Generation via ChatGPT , author=. 2024 , eprint=

  54. [54]

    2024 , eprint=

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. 2024 , eprint=

  55. [55]

    2024 , eprint=

    ChatDev: Communicative Agents for Software Development , author=. 2024 , eprint=

  56. [56]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Bias unveiled: Investigating social bias in LLM-generated code , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  57. [57]

    2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=

    Soen-101: Code generation by emulating software process models using large language model agents , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=

  58. [58]

    and Wallach, Hanna and Cotterell, Ryan , title =

    Zmigrod, Ran and Mielke, Sabrina J. and Wallach, Hanna and Cotterell, Ryan , title =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =. 2019 , address =

  59. [59]

    and Saligrama, Venkatesh and Kalai, Adam Tauman , title =

    Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James Y. and Saligrama, Venkatesh and Kalai, Adam Tauman , title =. Advances in Neural Information Processing Systems 29 (NeurIPS 2016) , pages =

  60. [60]

    Proceedings of the 34th International Conference on Software Engineering (ICSE 2012) , pages =

    Le Goues, Claire and Dewey-Vogt, Michael and Forrest, Stephanie and Weimer, Westley , title =. Proceedings of the 34th International Conference on Software Engineering (ICSE 2012) , pages =. 2012 , publisher =

  61. [61]

    Proceedings of the 38th International Conference on Software Engineering (ICSE 2016) , pages =

    Mechtaev, Sergey and Yi, Jooyong and Roychoudhury, Abhik , title =. Proceedings of the 38th International Conference on Software Engineering (ICSE 2016) , pages =. 2016 , publisher =

  62. [62]

    Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024) , pages =

    Xia, Chunqiu Steven and Zhang, Lingming , title =. Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024) , pages =. 2024 , publisher =

  63. [63]

    arXiv preprint , note=

    Xia, Chunqiu Steven and Deng, Yinlin and Dunn, Soren and Zhang, Lingming , title =. arXiv preprint , note=

  64. [64]

    Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

    Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Welleck, Sean and Majumder, Bodhisattwa Prasad and Gupta, Shashank and Yazdanbakhsh, Amir and Clark, Peter , title =. Advances in Neural Information Processing Systems 36 (Neu...

  65. [65]

    Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

    Shinn, Noah and Cassano, Federico and Berman, Edward and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

  66. [66]

    Ieee transactions on software engineering , volume=

    Genprog: A generic method for automatic software repair , author=. Ieee transactions on software engineering , volume=. 2011 , publisher=

  67. [67]

    Proceedings of the 43rd IEEE Symposium on Security and Privacy (SP 2022) , pages =

    Pearce, Hammond and Ahmad, Baleegh and Tan, Benjamin and Dolan-Gavitt, Brendan and Karri, Ramesh , title =. Proceedings of the 43rd IEEE Symposium on Security and Privacy (SP 2022) , pages =. 2022 , publisher =

  68. [68]

    Evaluating the Code Quality of

    Yeti. Evaluating the Code Quality of. arXiv preprint , note=

  69. [69]

    Siddiq, Mohammed Latif and Santos, Joanna C. S. , title =. Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S 2022) , pages =. 2022 , publisher =

  70. [70]

    Tony, Catherine and Mutas, Markus and Ferreyra, Nicolas E. D. Proceedings of the 2023 IEEE/ACM International Conference on Mining Software Repositories (MSR 2023) , pages =. 2023 , publisher =

  71. [71]

    arXiv preprint , note=

    A multi-language perspective on the robustness of LLM code generation , author=. arXiv preprint , note=

  72. [72]

    arXiv preprint , note=

    Specification-Driven Code Translation Powered by Large Language Models: How Far Are We? , author=. arXiv preprint , note=

  73. [73]

    arXiv preprint , note=

    BabelCoder: Agentic Code Translation with Specification Alignment , author=. arXiv preprint , note=

  74. [74]

    arXiv preprint , note=

    Secure-Instruct: An Automated Pipeline for Synthesizing Instruction-Tuning Datasets Using LLMs for Secure Code Generation , author=. arXiv preprint , note=

  75. [75]

    Empirical Software Engineering , volume=

    An exploratory study on fine-tuning large language models for secure code generation , author=. Empirical Software Engineering , volume=. 2026 , publisher=

  76. [76]

    arXiv preprint , note=

    Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation , author=. arXiv preprint , note=

  77. [77]

    arXiv preprint , note=

    HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair , author=. arXiv preprint , note=

  78. [78]

    https://docs.anthropic.com/en/docs/about-claude/models, accessed: 2024-06-20

    Anthropic (2024) Claude models. https://docs.anthropic.com/en/docs/about-claude/models, accessed: 2024-06-20

  79. [81]

    In: Advances in Neural Information Processing Systems 29 (NeurIPS 2016), pp 4349--4357

    Bolukbasi T, Chang KW, Zou JY, Saligrama V, Kalai AT (2016) Man is to computer programmer as woman is to homemaker? D ebiasing word embeddings. In: Advances in Neural Information Processing Systems 29 (NeurIPS 2016), pp 4349--4357

  80. [82]

    Evaluating Large Language Models Trained on Code

    Chen M, Tworek J, Jun H, Yuan Q, de Oliveira Pinto HP, Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G, Ray A, Puri R, Krueger G, Petrov M, Khlaaf H, Sastry G, Mishkin P, Chan B, Gray S, Ryder N, Pavlov M, Power A, Kaiser L, Bavarian M, Winter C, Tillet P, Such FP, Cummings D, Plappert M, Chantzis F, Barnes E, Herbert-Voss A, Guss WH, Nichol A, Paino A...

Showing first 80 references.