{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:HVF3LOTTC34HDWC2LTG2VQFFI3","short_pith_number":"pith:HVF3LOTT","schema_version":"1.0","canonical_sha256":"3d4bb5ba7316f871d85a5ccdaac0a546e902c3c27765137d80ddee3bd3d8c681","source":{"kind":"arxiv","id":"2305.01210","version":3},"attestation_state":"computed","paper":{"title":"Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Augmenting HumanEval with 80 times more test cases reveals that LLM-generated code contains substantially more functional errors than prior benchmarks detected.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.SE","authors_text":"Chunqiu Steven Xia, Jiawei Liu, Lingming Zhang, Yuyao Wang","submitted_at":"2023-05-02T05:46:48Z","abstract_excerpt":"Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus --"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2305.01210","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.SE","submitted_at":"2023-05-02T05:46:48Z","cross_cats_sorted":["cs.CL","cs.LG"],"title_canon_sha256":"e6912ff5b6a9a8d99edc6c7fc3fed66c47a34217d52c0df0c24b74db00f741a2","abstract_canon_sha256":"04b221cba3aa676fae32679075e1337b1723cea41e1848f41ca3a87499c2be97"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T02:44:08.792857Z","signature_b64":"sA80ZV9EHavQzk/VVsV1KEVA6AkhGdoYoy5fFFI8OpRQ3tuDGmpJAxU79AM2DdBeKY/GWo6HA44LU6yawYhYBg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"3d4bb5ba7316f871d85a5ccdaac0a546e902c3c27765137d80ddee3bd3d8c681","last_reissued_at":"2026-05-18T02:44:08.792377Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T02:44:08.792377Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Augmenting HumanEval with 80 times more test cases reveals that LLM-generated code contains substantially more functional errors than prior benchmarks detected.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.SE","authors_text":"Chunqiu Steven Xia, Jiawei Liu, Lingming Zhang, Yuyao Wang","submitted_at":"2023-05-02T05:46:48Z","abstract_excerpt":"Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus --"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our extensive evaluation across 26 popular LLMs demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The automatically generated test cases are functionally correct and do not introduce false failures or miss important edge cases in the code under test.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Augmenting HumanEval with 80 times more test cases reveals that LLM-generated code contains substantially more functional errors than prior benchmarks detected.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f4fb22ccd58e27830b15f68b44b2220c87919a62ae689ce9c9e453737b4bc643"},"source":{"id":"2305.01210","kind":"arxiv","version":3},"verdict":{"id":"8a32ed04-ba68-4560-bbd4-28249b6fa9a9","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T01:58:11.769884Z","strongest_claim":"Our extensive evaluation across 26 popular LLMs demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%.","one_line_summary":"EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The automatically generated test cases are functionally correct and do not introduce false failures or miss important edge cases in the code under test.","pith_extraction_headline":"Augmenting HumanEval with 80 times more test cases reveals that LLM-generated code contains substantially more functional errors than prior benchmarks detected."},"references":{"count":76,"sample":[{"doi":"","year":2022,"title":"T. Ahmed and P. Devanbu. Few-shot training llms for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–5, 2022","work_id":"1bb08f1e-203e-4f1f-a10c-f01969ecedb9","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988","work_id":"bf393c50-a11b-4a0f-8513-52428ede71f7","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021","work_id":"214a9f12-b122-4033-8e0c-c6414d4ff463","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"S. Bang, S. Nam, I. Chun, H. Y . Jhoo, and J. Lee. Smt-based translation validation for machine learning compiler. In Computer Aided Verification: 34th International Conference, CAV 2022, Haifa, Israe","work_id":"eaf2685d-5d0e-4925-8b9a-089a989d9bae","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Mar. 2021. If you use this software, please cite it using these metada","work_id":"7300281c-8472-4002-9d18-dd583a8986d3","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":76,"snapshot_sha256":"dc74bba5f761852b703d8e587e8f41fa5e683a46d8237897d95a747c48fd1e7d","internal_anchors":6},"formal_canon":{"evidence_count":2,"snapshot_sha256":"3d720ce297ebb3c900608d8650b78878727ef351a1186990a4cf63b72069ccf0"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2305.01210","created_at":"2026-05-18T02:44:08.792456+00:00"},{"alias_kind":"arxiv_version","alias_value":"2305.01210v3","created_at":"2026-05-18T02:44:08.792456+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2305.01210","created_at":"2026-05-18T02:44:08.792456+00:00"},{"alias_kind":"pith_short_12","alias_value":"HVF3LOTTC34H","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"HVF3LOTTC34HDWC2","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"HVF3LOTT","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":29,"internal_anchor_count":29,"sample":[{"citing_arxiv_id":"2605.08738","citing_title":"SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17958","citing_title":"Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15301","citing_title":"Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2512.20856","citing_title":"NVIDIA Nemotron 3: Efficient and Open Intelligence","ref_index":104,"is_internal_anchor":true},{"citing_arxiv_id":"2506.17298","citing_title":"Mercury: Ultra-Fast Language Models Based on Diffusion","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2309.05922","citing_title":"A Survey of Hallucination in Large Foundation Models","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16321","citing_title":"LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27098","citing_title":"Ensemble-Based Uncertainty Estimation for Code Correctness Estimation","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2303.17760","citing_title":"CAMEL: Communicative Agents for \"Mind\" Exploration of Large Language Model Society","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2306.11644","citing_title":"Textbooks Are All You Need","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26923","citing_title":"ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09059","citing_title":"Evaluating LLM-Generated Code: A Benchmark and Developer Study","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08738","citing_title":"SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09188","citing_title":"DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09023","citing_title":"Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03546","citing_title":"ProgramBench: Can Language Models Rebuild Programs From Scratch?","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2503.01743","citing_title":"Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24712","citing_title":"When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24703","citing_title":"Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05267","citing_title":"Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code","ref_index":71,"is_internal_anchor":true},{"citing_arxiv_id":"2405.15793","citing_title":"SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19826","citing_title":"Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12088","citing_title":"Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10387","citing_title":"Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2305.06161","citing_title":"StarCoder: may the source be with you!","ref_index":289,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3","json":"https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3.json","graph_json":"https://pith.science/api/pith-number/HVF3LOTTC34HDWC2LTG2VQFFI3/graph.json","events_json":"https://pith.science/api/pith-number/HVF3LOTTC34HDWC2LTG2VQFFI3/events.json","paper":"https://pith.science/paper/HVF3LOTT"},"agent_actions":{"view_html":"https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3","download_json":"https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3.json","view_paper":"https://pith.science/paper/HVF3LOTT","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2305.01210&json=true","fetch_graph":"https://pith.science/api/pith-number/HVF3LOTTC34HDWC2LTG2VQFFI3/graph.json","fetch_events":"https://pith.science/api/pith-number/HVF3LOTTC34HDWC2LTG2VQFFI3/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3/action/timestamp_anchor","attest_storage":"https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3/action/storage_attestation","attest_author":"https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3/action/author_attestation","sign_citation":"https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3/action/citation_signature","submit_replication":"https://pith.science/pith/HVF3LOTTC34HDWC2LTG2VQFFI3/action/replication_record"}},"created_at":"2026-05-18T02:44:08.792456+00:00","updated_at":"2026-05-18T02:44:08.792456+00:00"}