CoRR abs/2303.17568 (2023) , author=

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X · 2023 · arXiv 2303.17568

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

citation-role summary

background 2 dataset 2

citation-polarity summary

background 3 use dataset 1

representative citing papers

Social Bias in LLM-Generated Code: Benchmark and Mitigation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

cs.CL · 2023-12-20 · accept · novelty 7.0

A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

cs.SE · 2023-05-02 · accept · novelty 7.0

EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.

Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

LPES uses per-layer scaling factors optimized by a genetic algorithm with Bézier curves to balance attention and improve long-context LLM performance by up to 11.2% on key-value retrieval.

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

cs.HC · 2026-05-26 · unverdicted · novelty 6.0

Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback

cs.SE · 2026-05-18 · unverdicted · novelty 5.0

A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.

InternLM2 Technical Report

cs.CL · 2024-03-26 · unverdicted · novelty 5.0

InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0 · 2 refs

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Specification-Driven Code Translation Powered by Large Language Models: How Far Are We?

cs.SE · 2024-12-05 · unverdicted · novelty 4.0

NL specifications alone do not improve LLM code translation performance, but combining them with source code yields gains in select language pairs with no overall consistent benefit.

CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology

cs.SE · 2024-02-02 · unverdicted · novelty 4.0

CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.

Surveying GenAI-based Automation in Printed Circuit Board Design and Test

cs.AR · 2026-06-10 · unverdicted · novelty 3.0

Survey of GenAI in PCB design lifecycle presenting taxonomy, technical challenges, and research directions.

A Survey of Large Language Models

cs.CL · 2023-03-31 · accept · novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Evaluating LLM-Generated Code: A Benchmark and Developer Study

cs.SE · 2026-05-09

citing papers explorer

Showing 2 of 2 citing papers after filters.

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation cs.SE · 2023-05-02 · accept · none · ref 76
EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.
Evaluating LLM-Generated Code: A Benchmark and Developer Study cs.SE · 2026-05-09 · unreviewed · ref 23

CoRR abs/2303.17568 (2023) , author=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer