pith. sign in
Pith Number

pith:HVF3LOTT

pith:2023:HVF3LOTTC34HDWC2LTG2VQFFI3
not attested not anchored not stored refs resolved

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Chunqiu Steven Xia, Jiawei Liu, Lingming Zhang, Yuyao Wang

Augmenting HumanEval with 80 times more test cases reveals that LLM-generated code contains substantially more functional errors than prior benchmarks detected.

arxiv:2305.01210 v3 · 2023-05-02 · cs.SE · cs.CL · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{HVF3LOTTC34HDWC2LTG2VQFFI3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our extensive evaluation across 26 popular LLMs demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%.

C2weakest assumption

The automatically generated test cases are functionally correct and do not introduce false failures or miss important edge cases in the code under test.

C3one line summary

EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.

References

76 extracted · 76 resolved · 6 Pith anchors

[1] T. Ahmed and P. Devanbu. Few-shot training llms for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–5, 2022 2022
[2] Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 2023
[3] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021 2021
[4] S. Bang, S. Nam, I. Chun, H. Y . Jhoo, and J. Lee. Smt-based translation validation for machine learning compiler. In Computer Aided Verification: 34th International Conference, CAV 2022, Haifa, Israe 2022
[5] S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Mar. 2021. If you use this software, please cite it using these metada 2021

Formal links

2 machine-checked theorem links

Cited by

28 papers in Pith

Receipt and verification
First computed 2026-05-18T02:44:08.792377Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

3d4bb5ba7316f871d85a5ccdaac0a546e902c3c27765137d80ddee3bd3d8c681

Aliases

arxiv: 2305.01210 · arxiv_version: 2305.01210v3 · doi: 10.48550/arxiv.2305.01210 · pith_short_12: HVF3LOTTC34H · pith_short_16: HVF3LOTTC34HDWC2 · pith_short_8: HVF3LOTT
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 3d4bb5ba7316f871d85a5ccdaac0a546e902c3c27765137d80ddee3bd3d8c681
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "04b221cba3aa676fae32679075e1337b1723cea41e1848f41ca3a87499c2be97",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.SE",
    "submitted_at": "2023-05-02T05:46:48Z",
    "title_canon_sha256": "e6912ff5b6a9a8d99edc6c7fc3fed66c47a34217d52c0df0c24b74db00f741a2"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2305.01210",
    "kind": "arxiv",
    "version": 3
  }
}