{"paper":{"title":"CodeBLEU: a Method for Automatic Evaluation of Code Synthesis","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy.","cross_cats":["cs.CL"],"primary_cat":"cs.SE","authors_text":"Ambrosio Blanco, Daya Guo, Duyu Tang, Long Zhou, Ming Zhou, Neel Sundaresan, Shuai Lu, Shuai Ma, Shujie Liu, Shuo Ren","submitted_at":"2020-09-22T03:10:49Z","abstract_excerpt":"Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBL"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the weighted combination of n-gram, AST, and data-flow matches will reliably reflect human judgment of code quality across tasks without the weights being overfitted to the specific evaluation sets.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5094019a1c9de41f5662359025dd9b7e2f008e66f02e6ae48e56eb32402d2f19"},"source":{"id":"2009.10297","kind":"arxiv","version":2},"verdict":{"id":"6381d62b-2e4b-47ad-8032-f34ce46b7216","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T19:56:20.500957Z","strongest_claim":"Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.","one_line_summary":"CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the weighted combination of n-gram, AST, and data-flow matches will reliably reflect human judgment of code quality across tasks without the weights being overfitted to the specific evaluation sets.","pith_extraction_headline":"CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy."},"references":{"count":93,"sample":[{"doi":"","year":null,"title":"Advances in Neural Information Processing Systems , pages=","work_id":"8010d89b-4f9a-4726-9812-02d8b3ed550b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Achieving human parity on automatic chinese to english news translation","work_id":"c80c68b1-3b0a-482e-b79d-b02db29279d6","ref_index":2,"cited_arxiv_id":"1803.05567","is_internal_anchor":true},{"doi":"","year":null,"title":"Unsupervised Neural Machine Translation","work_id":"d0396497-df5e-4a47-9d4b-ada8de4e0a1c","ref_index":3,"cited_arxiv_id":"1710.11041","is_internal_anchor":true},{"doi":"","year":null,"title":"Unsupervised machine translation using monolingual corpora only","work_id":"df3261c0-84d8-483e-9f29-09aa2fe5227a","ref_index":4,"cited_arxiv_id":"1711.00043","is_internal_anchor":true},{"doi":"","year":null,"title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=","work_id":"fe0be63d-14f0-4421-b391-ce04609b4a3b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":93,"snapshot_sha256":"9dcab658f88c6bf5a7527bb73c1b78f6988caefb7f8cf6410f5bd05db93b6ecc","internal_anchors":16},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2d59f7422650717871670afe3de1e6977bde186a0bec214d34ce1a73686be1b9"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}