{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2020:EZH63GUIZJN25JRIYI2IJLLVPQ","short_pith_number":"pith:EZH63GUI","schema_version":"1.0","canonical_sha256":"264fed9a88ca5baea628c23484ad757c25db12adec91b448e72ca34c273020b2","source":{"kind":"arxiv","id":"2009.10297","version":2},"attestation_state":"computed","paper":{"title":"CodeBLEU: a Method for Automatic Evaluation of Code Synthesis","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy.","cross_cats":["cs.CL"],"primary_cat":"cs.SE","authors_text":"Ambrosio Blanco, Daya Guo, Duyu Tang, Long Zhou, Ming Zhou, Neel Sundaresan, Shuai Lu, Shuai Ma, Shujie Liu, Shuo Ren","submitted_at":"2020-09-22T03:10:49Z","abstract_excerpt":"Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBL"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2009.10297","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.SE","submitted_at":"2020-09-22T03:10:49Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"034c637e9bb91e2a5a9b62931b450fe548591fef10be3c35891837159640f111","abstract_canon_sha256":"ffc5ae0ba7af0397ab2e62a8724894df8b10ba7e4d47095d5750799d25a2833a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:21.881558Z","signature_b64":"lvcYKa5uMxMwALZoLP+pdSbDoF1lOafHzV8R1T73v91XIMp00NZOkOPYq0iGjqowYNlTQn1I2OKAcf1QPw+ZCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"264fed9a88ca5baea628c23484ad757c25db12adec91b448e72ca34c273020b2","last_reissued_at":"2026-05-17T23:39:21.880910Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:21.880910Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"CodeBLEU: a Method for Automatic Evaluation of Code Synthesis","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy.","cross_cats":["cs.CL"],"primary_cat":"cs.SE","authors_text":"Ambrosio Blanco, Daya Guo, Duyu Tang, Long Zhou, Ming Zhou, Neel Sundaresan, Shuai Lu, Shuai Ma, Shujie Liu, Shuo Ren","submitted_at":"2020-09-22T03:10:49Z","abstract_excerpt":"Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBL"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the weighted combination of n-gram, AST, and data-flow matches will reliably reflect human judgment of code quality across tasks without the weights being overfitted to the specific evaluation sets.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5094019a1c9de41f5662359025dd9b7e2f008e66f02e6ae48e56eb32402d2f19"},"source":{"id":"2009.10297","kind":"arxiv","version":2},"verdict":{"id":"6381d62b-2e4b-47ad-8032-f34ce46b7216","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T19:56:20.500957Z","strongest_claim":"Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.","one_line_summary":"CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the weighted combination of n-gram, AST, and data-flow matches will reliably reflect human judgment of code quality across tasks without the weights being overfitted to the specific evaluation sets.","pith_extraction_headline":"CodeBLEU evaluates generated code by adding syntax tree and data-flow matches to n-gram overlap so that scores align better with human programmer judgments than BLEU or exact accuracy."},"references":{"count":93,"sample":[{"doi":"","year":null,"title":"Advances in Neural Information Processing Systems , pages=","work_id":"8010d89b-4f9a-4726-9812-02d8b3ed550b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Achieving human parity on automatic chinese to english news translation","work_id":"c80c68b1-3b0a-482e-b79d-b02db29279d6","ref_index":2,"cited_arxiv_id":"1803.05567","is_internal_anchor":true},{"doi":"","year":null,"title":"Unsupervised Neural Machine Translation","work_id":"d0396497-df5e-4a47-9d4b-ada8de4e0a1c","ref_index":3,"cited_arxiv_id":"1710.11041","is_internal_anchor":true},{"doi":"","year":null,"title":"Unsupervised machine translation using monolingual corpora only","work_id":"df3261c0-84d8-483e-9f29-09aa2fe5227a","ref_index":4,"cited_arxiv_id":"1711.00043","is_internal_anchor":true},{"doi":"","year":null,"title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , volume=","work_id":"fe0be63d-14f0-4421-b391-ce04609b4a3b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":93,"snapshot_sha256":"9dcab658f88c6bf5a7527bb73c1b78f6988caefb7f8cf6410f5bd05db93b6ecc","internal_anchors":16},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2d59f7422650717871670afe3de1e6977bde186a0bec214d34ce1a73686be1b9"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2009.10297","created_at":"2026-05-17T23:39:21.881033+00:00"},{"alias_kind":"arxiv_version","alias_value":"2009.10297v2","created_at":"2026-05-17T23:39:21.881033+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2009.10297","created_at":"2026-05-17T23:39:21.881033+00:00"},{"alias_kind":"pith_short_12","alias_value":"EZH63GUIZJN2","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"EZH63GUIZJN25JRI","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"EZH63GUI","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":48,"internal_anchor_count":48,"sample":[{"citing_arxiv_id":"2508.15503","citing_title":"Guidelines for Empirical Studies in Software Engineering involving Large Language Models","ref_index":109,"is_internal_anchor":true},{"citing_arxiv_id":"2409.19894","citing_title":"TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2502.14925","citing_title":"CODEPROMPTZIP: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2502.18632","citing_title":"Automated Knowledge Component Generation for Interpretable Knowledge Tracing in Coding Problems","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2503.14281","citing_title":"XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2503.16771","citing_title":"Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2601.22655","citing_title":"Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21615","citing_title":"ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21984","citing_title":"Echo: Learning from Experience Data via User-Driven Refinement","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2310.02170","citing_title":"A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17957","citing_title":"Contextualized Code Pretraining for Code Generation","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19717","citing_title":"Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17444","citing_title":"MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17450","citing_title":"ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2507.20619","citing_title":"Generating Project-Specific Test Cases with Requirement Validation Intention","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2508.15503","citing_title":"Guidelines for Empirical Studies in Software Engineering involving Large Language Models","ref_index":109,"is_internal_anchor":true},{"citing_arxiv_id":"2508.16771","citing_title":"EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2510.04265","citing_title":"Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2510.10956","citing_title":"Project-Level C-to-Rust Translation via Pointer Knowledge Graphs","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21285","citing_title":"PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2602.01785","citing_title":"CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2602.05353","citing_title":"AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2102.04664","citing_title":"CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation","ref_index":68,"is_internal_anchor":true},{"citing_arxiv_id":"2109.00859","citing_title":"CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2603.16791","citing_title":"Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring for Novice Programmers","ref_index":56,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/EZH63GUIZJN25JRIYI2IJLLVPQ","json":"https://pith.science/pith/EZH63GUIZJN25JRIYI2IJLLVPQ.json","graph_json":"https://pith.science/api/pith-number/EZH63GUIZJN25JRIYI2IJLLVPQ/graph.json","events_json":"https://pith.science/api/pith-number/EZH63GUIZJN25JRIYI2IJLLVPQ/events.json","paper":"https://pith.science/paper/EZH63GUI"},"agent_actions":{"view_html":"https://pith.science/pith/EZH63GUIZJN25JRIYI2IJLLVPQ","download_json":"https://pith.science/pith/EZH63GUIZJN25JRIYI2IJLLVPQ.json","view_paper":"https://pith.science/paper/EZH63GUI","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2009.10297&json=true","fetch_graph":"https://pith.science/api/pith-number/EZH63GUIZJN25JRIYI2IJLLVPQ/graph.json","fetch_events":"https://pith.science/api/pith-number/EZH63GUIZJN25JRIYI2IJLLVPQ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/EZH63GUIZJN25JRIYI2IJLLVPQ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/EZH63GUIZJN25JRIYI2IJLLVPQ/action/storage_attestation","attest_author":"https://pith.science/pith/EZH63GUIZJN25JRIYI2IJLLVPQ/action/author_attestation","sign_citation":"https://pith.science/pith/EZH63GUIZJN25JRIYI2IJLLVPQ/action/citation_signature","submit_replication":"https://pith.science/pith/EZH63GUIZJN25JRIYI2IJLLVPQ/action/replication_record"}},"created_at":"2026-05-17T23:39:21.881033+00:00","updated_at":"2026-05-17T23:39:21.881033+00:00"}