pith. machine review for the scientific record. sign in

arxiv: 2604.15390 · v3 · submitted 2026-04-16 · 💻 cs.SE · cs.AI

Recognition: unknown

Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords code deobfuscationchain of thought promptingcontrol flow obfuscationlarge language modelscontrol flow graph reconstructionprogram semantics preservationreverse engineeringopaque predicates
0
0 comments X

The pith

Chain-of-thought prompting improves large language models' recovery of control flow and semantics in obfuscated C code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether step-by-step reasoning prompts help large language models undo control-flow obfuscation such as flattening and opaque predicates. It evaluates five models on standard C benchmarks, measuring both how well the original control-flow graph is reconstructed and how faithfully the program's behavior is preserved through output matching. CoT prompting raises performance over direct prompting, with the largest model showing average gains of 16 percent on graph recovery and 20.5 percent on semantic preservation. A reader would care because manual deobfuscation is slow and costly, and better automated assistance could shorten reverse-engineering time for security and maintenance tasks.

Core claim

The paper establishes that Chain-of-Thought prompting, which directs the model through explicit reasoning steps for code analysis, produces higher-quality deobfuscated output than zero-shot prompting. Across the tested models and obfuscation types, this yields clearer control-flow graphs and outputs that match the original program on test inputs more closely. GPT5 records the strongest results, with roughly 16 percent better control-flow graph reconstruction and 20.5 percent better semantic preservation than simple prompting. Results also vary with the original program's control-flow complexity and the specific obfuscator used.

What carries the argument

Chain-of-Thought (CoT) prompting, which supplies the model with explicit, task-specific reasoning steps before producing the deobfuscated code.

If this is right

  • CoT prompting reduces reliance on manual inspection for recovering readable control flow from flattened or predicate-protected code.
  • Model performance scales with both the degree of obfuscation and the intrinsic complexity of the original control-flow graph.
  • The approach supplies more faithful graph reconstructions and behavior-preserving code than direct prompting alone.
  • CoT can serve as an assistive layer that shortens the time needed for reverse engineering while improving code explainability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting pattern could be tried on other obfuscation families such as data obfuscation or virtualization-based protection.
  • Hybrid pipelines that feed CoT output into existing static-analysis tools might further raise reconstruction accuracy.
  • Testing on additional languages and larger, real malware samples would show how far the observed gains extend.

Load-bearing premise

That matching program outputs on a fixed set of test inputs is enough to confirm the deobfuscated code still carries the original semantics.

What would settle it

An obfuscated program where the CoT output matches the original on every test input yet produces different behavior on an untested input, or a real-world control-flow-obfuscated binary outside the chosen C benchmark set on which CoT fails to recover the original graph structure.

Figures

Figures reproduced from arXiv: 2604.15390 by Edward Raff, Manas Gaur, Sarvesh Baskar, Seyedreza Mohseni.

Figure 1
Figure 1. Figure 1: The predicate in the highlighted region (red) is invariant and evaluates to [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Control Flow Graph (CFG) of (a) original code, (b) Obfuscated code by Tigress with Control Flow Flattening (CFF), and (c) Deobfuscated code by DeepSeek-V2, which shows some sort of hallucination. analysis, because their defining characteristic is a branch condition whose truth value is fixed at construction time yet syntactically indistinguishable from a legitimate run￾time decision; careful propagation of… view at source ↗
Figure 3
Figure 3. Figure 3: BLEU and SRS scores for Group 2 experiment 2. obfuscated using the open source tools Tigress and O￾LLVM. Before being provided to the models, we remove all code names and comments so that the models cannot rely on high-level hints about the algorithm’s type or structure. We evaluate our deobfuscation pipeline on a fixed pool of 12 standard C benchmark programs that cover a wide range of control-flow graph … view at source ↗
Figure 5
Figure 5. Figure 5: Average BLEU and SRS scores across models. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (Left) Original snippet code. (Right) Deobfus￾cated snippet code by DeepSeek-V2. Headers and Node structure declaration are missing. Original code is obfus￾cated by Tigress [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustrative semantic hallucination during code deobfuscation. The original function computes the height of a [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Code deobfuscation is the task of recovering a readable version of a program while preserving its original behavior. In practice, this often requires days or even months of manual work with complex and expensive analysis tools. In this paper, we explore an alternative approach based on Chain-of-Thought (CoT) prompting, where a large language model is guided through explicit, step-by-step reasoning tailored for code analysis. We focus on control flow obfuscation, including Control Flow Flattening (CFF), Opaque Predicates, and their combination, and we measure both structural recovery of the control flow graph and preservation of program semantics. We evaluate five state-of-the-art large language models and show that CoT prompting significantly improves deobfuscation quality compared with simple prompting. We validate our approach on a diverse set of standard C benchmarks and report results using both structural metrics for control flow graphs and semantic metrics based on output similarity. Among the tested models and by applying CoT, GPT5 achieves the strongest overall performance, with an average gain of about 16% in control-flow graph reconstruction and about 20.5% in semantic preservation across our benchmarks compared to zero-shot prompting. Our results also show that model performance depends not only on the obfuscation level and the chosen obfuscator but also on the intrinsic complexity of the original control flow graph. Collectively, these findings suggest that CoT-guided large language models can serve as effective assistants for code deobfuscation, providing improved code explainability, more faithful control flow graph reconstruction, and better preservation of program behavior while potentially reducing the manual effort needed for reverse engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that Chain-of-Thought (CoT) prompting substantially improves large language model performance on control-flow deobfuscation tasks (Control Flow Flattening, Opaque Predicates, and combinations) relative to zero-shot prompting. It evaluates five state-of-the-art LLMs on standard C benchmarks, reports that GPT5 with CoT yields average gains of ~16% in control-flow graph reconstruction and ~20.5% in semantic preservation (measured by output similarity), and notes that results vary with obfuscation level and original CFG complexity.

Significance. If the evaluation protocol and semantic metric are shown to be reliable, the work would provide useful empirical evidence that structured prompting can assist automated deobfuscation and reduce manual reverse-engineering effort. The observation that LLM success depends on both obfuscation technique and intrinsic program complexity offers a practical insight for tool builders in software security and program analysis.

major comments (2)
  1. Abstract: the central quantitative claims (~16% CFG reconstruction gain and ~20.5% semantic preservation gain for GPT5) are stated without any description of the experimental protocol, number or identity of benchmarks, number of runs, statistical tests, or error bars, preventing assessment of whether the reported improvements are robust or reproducible.
  2. Abstract: semantic preservation is defined solely via 'output similarity' on test inputs, yet no information is supplied on test-suite coverage, equivalence tolerance, handling of side effects, memory layout, or non-determinism. This proxy is load-bearing for the claim of improved deobfuscation quality, especially under CFF and opaque predicates where unexercised paths may diverge.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's clarity and the need for more experimental context. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: the central quantitative claims (~16% CFG reconstruction gain and ~20.5% semantic preservation gain for GPT5) are stated without any description of the experimental protocol, number or identity of benchmarks, number of runs, statistical tests, or error bars, preventing assessment of whether the reported improvements are robust or reproducible.

    Authors: We agree that the abstract would benefit from additional context on the evaluation setup to support assessment of the claims. In the revised version, we will expand the abstract to note that the results are based on a diverse set of standard C benchmarks, five state-of-the-art LLMs, and averages across multiple runs per configuration, with full details on the protocol, benchmark identities, and any statistical measures provided in the experimental section of the paper. This revision will be made while respecting abstract length limits. revision: yes

  2. Referee: Abstract: semantic preservation is defined solely via 'output similarity' on test inputs, yet no information is supplied on test-suite coverage, equivalence tolerance, handling of side effects, memory layout, or non-determinism. This proxy is load-bearing for the claim of improved deobfuscation quality, especially under CFF and opaque predicates where unexercised paths may diverge.

    Authors: We acknowledge that the abstract's description of the semantic metric is high-level. The full manuscript provides details on the test inputs, coverage considerations, and handling of program behavior in the evaluation methodology section. We will revise the abstract to clarify that semantic preservation is assessed via output equivalence on test suites targeting the original program's observable behavior, including notes on equivalence tolerance and non-determinism where relevant. This addresses the concern without altering the core metric, which we maintain is suitable for verifying deobfuscation quality on exercised paths. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of prompting strategies

full rationale

The paper reports experimental results from evaluating five LLMs on control-flow deobfuscation benchmarks using CoT versus zero-shot prompting. It measures CFG reconstruction accuracy and semantic preservation via output similarity on standard C programs. No equations, derivations, fitted parameters, uniqueness theorems, or self-citations are used to derive the central claims. All quantitative gains (e.g., 16% CFG, 20.5% semantic) are direct empirical observations from the described test suite. The work is self-contained against external benchmarks and contains no load-bearing steps that reduce to prior fitted quantities or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical prompting study with no mathematical model, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5609 in / 1076 out tokens · 44773 ms · 2026-05-10T11:21:06.477379+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Protecting software through obfuscation: Can it keep pace with progress in code analysis?

    S. Schrittwieser, S. Katzenbeisser, J. Kinder, G. Merzdovnik, and E. Weippl, “Protecting software through obfuscation: Can it keep pace with progress in code analysis?”Acm computing surveys (csur), vol. 49, no. 1, pp. 1–37, 2016

  2. [2]

    Obfuscation of executable code to improve resistance to static disassembly,

    C. Linn and S. Debray, “Obfuscation of executable code to improve resistance to static disassembly,” inProceedings of the 10th ACM conference on Computer and communications security, 2003, pp. 290–299

  3. [3]

    Enhancing reverse engineering: Investigating and benchmarking large language models for vulnerability analysis in decompiled binaries,

    D. Manuel, N. T. Islam, J. Khoury, A. Nunez, E. Bou-Harb, and P. Najafirad, “Enhancing reverse engineering: Investigating and benchmarking large language models for vulnerability analysis in decompiled binaries,”arXiv preprint arXiv:2411.04981, 2024

  4. [4]

    Can llms obfuscate code? a systematic analysis of large language models into assembly code obfuscation,

    S. Mohseni, S. Mohammadi, D. Tilwani, Y . Saxena, G. K. Ndawula, S. Vema, E. Raff, and M. Gaur, “Can llms obfuscate code? a systematic analysis of large language models into assembly code obfuscation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, 2025, pp. 24 893–24 901

  5. [5]

    Taxonomy of static code analysis tools,

    J. Novak, A. Krajncet al., “Taxonomy of static code analysis tools,” inThe 33rd international convention MIPRO. IEEE, 2010, pp. 418–422

  6. [6]

    Code obfuscation literature survey,

    A. Balakrishnan and C. Schulze, “Code obfuscation literature survey,”CS701 Construction of compilers, vol. 19, no. 31, p. 10, 2005

  7. [7]

    Layered obfuscation: a taxonomy of software obfuscation techniques for layered security,

    H. Xu, Y . Zhou, J. Ming, and M. Lyu, “Layered obfuscation: a taxonomy of software obfuscation techniques for layered security,” Cybersecurity, vol. 3, pp. 1–18, 2020

  8. [8]

    Predicting the re- silience of obfuscated code against symbolic execution attacks via machine learning,

    S. Banescu, C. Collberg, and A. Pretschner, “Predicting the re- silience of obfuscated code against symbolic execution attacks via machine learning,” inProceedings of the 26th USENIX Security Symposium (USENIX Security ’17), Vancouver, BC, Canada, 2017, pp. 661–678

  9. [9]

    Symbolic execution for software testing: Three decades later,

    C. Cadar and K. Sen, “Symbolic execution for software testing: Three decades later,”Communications of the ACM, vol. 56, no. 2, pp. 82–90, 2013

  10. [10]

    A survey of symbolic execution techniques,

    R. Baldoni, E. Coppa, D. C. D’Elia, C. Demetrescu, and I. Finoc- chi, “A survey of symbolic execution techniques,”ACM Computing Surveys, vol. 51, no. 3, pp. 1–39, 2018

  11. [11]

    Manufacturing resilient bi-opaque predicates against symbolic execution,

    H. Xu, Y . Zhou, Y . Kang, F. Tu, and M. R. Lyu, “Manufacturing resilient bi-opaque predicates against symbolic execution,” in2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Luxembourg, 2018, pp. 666–677

  12. [12]

    Dobf: A deobfuscation pre-training objective for programming languages,

    M.-A. Lachaux, B. Roziere, M. Szafraniec, and G. Lample, “Dobf: A deobfuscation pre-training objective for programming languages,”Advances in Neural Information Processing Systems, vol. 34, pp. 14 967–14 979, 2021

  13. [13]

    Static code analysis in the ai era: An in-depth exploration of the concept, function, and potential of intelligent code analysis agents,

    G. Fan, X. Xie, X. Zheng, Y . Liang, and P. Di, “Static code analysis in the ai era: An in-depth exploration of the concept, function, and potential of intelligent code analysis agents,”arXiv preprint arXiv:2310.08837, 2023

  14. [14]

    A survey on machine learning techniques for source code analysis,

    T. Sharma, M. Kechagia, S. Georgiou, R. Tiwari, I. Vats, H. Moazen, and F. Sarro, “A survey on machine learning techniques for source code analysis,”arXiv preprint arXiv:2110.09610, 2021

  15. [15]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information pro- cessing systems, vol. 35, pp. 24 824–24 837, 2022

  16. [16]

    A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,”arXiv preprint arXiv:2402.07927, 2024

  17. [17]

    Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

    Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,”arXiv preprint arXiv:2210.03493, 2022

  18. [18]

    Towards bet- ter chain-of-thought prompting strategies: A survey,

    Z. Yu, L. He, Z. Wu, X. Dai, and J. Chen, “Towards better chain-of-thought prompting strategies: A survey,”arXiv preprint arXiv:2310.04959, 2023

  19. [19]

    Obfuscator- llvm–software protection for the masses,

    P. Junod, J. Rinaldini, J. Wehrli, and J. Michielin, “Obfuscator- llvm–software protection for the masses,” in2015 ieee/acm 1st international workshop on software protection. IEEE, 2015, pp. 3–9

  20. [20]

    Deobfuscation: Reverse engineering obfuscated code,

    S. K. Udupa, S. K. Debray, and M. Madou, “Deobfuscation: Reverse engineering obfuscated code,” in12th Working Conference on Reverse Engineering (WCRE’05). IEEE, 2005, pp. 10–pp

  21. [21]

    Impact of code deobfuscation and feature interaction in android malware detection,

    Y .-C. Chen, H.-Y . Chen, T. Takahashi, B. Sun, and T.-N. Lin, “Impact of code deobfuscation and feature interaction in android malware detection,”IEEE Access, vol. 9, pp. 123 208–123 219, 2021

  22. [22]

    Camouflage in malware: from encryption to metamorphism,

    B. B. Rad, M. Masrom, and S. Ibrahim, “Camouflage in malware: from encryption to metamorphism,”International Journal of Com- puter Science and Network Security, vol. 12, no. 8, pp. 74–83, 2012

  23. [23]

    From static to ai-driven detection: A comprehensive review of obfuscated malware techniques,

    S. Chandran, S. R. Syam, S. Sankaran, T. Pandey, and K. Achuthan, “From static to ai-driven detection: A comprehensive review of obfuscated malware techniques,”IEEE Access, 2025

  24. [24]

    Use of cryptography in malware ob- fuscation,

    H. J. Asghar, B. Z. H. Zhao, M. Ikram, G. Nguyen, D. Kaafar, S. Lamont, and D. Coscia, “Use of cryptography in malware ob- fuscation,”Journal of Computer Virology and Hacking Techniques, vol. 20, no. 1, pp. 135–152, 2024

  25. [25]

    Control flow ob- fuscation for android applications,

    V . Balachandran, D. J. Tan, V . L. Thinget al., “Control flow ob- fuscation for android applications,”Computers & Security, vol. 61, pp. 72–93, 2016

  26. [26]

    A framework to quantify the quality of source code obfuscation,

    H. Jin, J. Lee, S. Yang, K. Kim, and D. H. Lee, “A framework to quantify the quality of source code obfuscation,”Applied Sciences, vol. 14, no. 12, p. 5056, 2024

  27. [27]

    Lightweight dispatcher constructions for control flow flattening,

    B. Johansson, P. Lantz, and M. Liljenstam, “Lightweight dispatcher constructions for control flow flattening,” inProceedings of the 7th Software Security, Protection, and Reverse Engineering/Software Security and Protection Workshop, 2017, pp. 1–12

  28. [28]

    A generic approach to automatic deobfuscation of executable code,

    B. Yadegari, B. Johannesmeyer, B. Whitely, and S. Debray, “A generic approach to automatic deobfuscation of executable code,” inProceedings of the IEEE Symposium on Security and Privacy (S&P). IEEE, 2015, pp. 674–691

  29. [29]

    Statistical deobfuscation of android applications,

    B. Bichsel, V . Raychev, P. Tsankov, and M. Vechev, “Statistical deobfuscation of android applications,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2016, pp. 343–355

  30. [30]

    Obfuscating C++ programs via control flow flattening,

    G. L ´aszl´o and ´A. Kiss, “Obfuscating C++ programs via control flow flattening,”Annales Universitatis Scientiarum Budapestinensis de Rolando E ¨otv¨os Nominatae, Sectio Computatorica, vol. 30, pp. 3–19, 2009

  31. [31]

    Neural network-based graph embedding for cross-platform binary code similarity detection,

    X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural network-based graph embedding for cross-platform binary code similarity detection,” inProceedings of the ACM SIGSAC Confer- ence on Computer and Communications Security (CCS). ACM, 2017, pp. 363–376

  32. [32]

    Toxic code snippets on stack overflow,

    C. Ragkhitwetsagul, J. Krinke, M. Paixao, G. Bianco, and R. Oliveto, “Toxic code snippets on stack overflow,”IEEE Transac- tions on Software Engineering, vol. 48, no. 3, pp. 733–750, 2022

  33. [33]

    Assessing llms in malicious code deobfuscation of real-world malware campaigns,

    C. Patsakis, F. Casino, and N. Lykousas, “Assessing llms in malicious code deobfuscation of real-world malware campaigns,” Expert systems with applications, vol. 256, p. 124912, 2024

  34. [34]

    Exploring the potential of llms for code deobfuscation,

    D. Beste, G. Menguy, H. Hajipour, M. Fritz, A. E. Cin `a, S. Bardin, T. Holz, T. Eisenhofer, and L. Sch ¨onherr, “Exploring the potential of llms for code deobfuscation,” inInternational Conference on Detection of Intrusions and Malware, and Vulnerability Assess- ment. Springer, 2025, pp. 267–286

  35. [35]

    Can llms recover program semantics? a systematic evaluation with symbolic execution,

    R. Feng and S. Saha, “Can llms recover program semantics? a systematic evaluation with symbolic execution,”arXiv preprint arXiv:2511.19130, 2025

  36. [36]

    Malicious source code detection using a translation model,

    C. Tsfaty and M. Fire, “Malicious source code detection using a translation model,”Patterns, vol. 4, no. 7, 2023

  37. [37]

    Deobfuscation, unpacking, and decoding of obfuscated malicious javascript for machine learning models detection performance improvement,

    S. Ndichu, S. Kim, and S. Ozawa, “Deobfuscation, unpacking, and decoding of obfuscated malicious javascript for machine learning models detection performance improvement,”CAAI Transactions on Intelligence Technology, vol. 5, no. 3, pp. 184–192, 2020

  38. [38]

    A survey of the recent trends in deep learning based malware detection,

    U.-e.-H. Tayyab, F. B. Khan, M. H. Durad, A. Khan, and Y . S. Lee, “A survey of the recent trends in deep learning based malware detection,”Journal of Cybersecurity and Privacy, vol. 2, no. 4, pp. 800–829, 2022

  39. [39]

    Dfsgraph: Data flow semantic model for intermediate representation programs based on graph network,

    K. Tang, Z. Shan, C. Zhang, L. Xu, M. Qiao, and F. Liu, “Dfsgraph: Data flow semantic model for intermediate representation programs based on graph network,”Electronics, vol. 11, no. 19, p. 3230, 2022

  40. [40]

    Deep graph learning for circuit deobfuscation,

    Z. Chen, L. Zhang, G. Kolhe, H. M. Kamali, S. Rafatirad, S. M. Pudukotai Dinakarrao, H. Homayoun, C.-T. Lu, and L. Zhao, “Deep graph learning for circuit deobfuscation,”Frontiers in big Data, vol. 4, p. 608286, 2021

  41. [41]

    A neural network-based cognitive obfuscation toward enhanced logic locking,

    R. Hassan, G. Kolhe, S. Rafatirad, H. Homayoun, and S. M. P. Dinakarrao, “A neural network-based cognitive obfuscation toward enhanced logic locking,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 4587–4599, 2021

  42. [42]

    Towards machine- learning-based oracle-guided analog circuit deobfuscation,

    D. Jain, G. Zhao, R. Datta, and K. Shamsi, “Towards machine- learning-based oracle-guided analog circuit deobfuscation,” in2024 IEEE International Test Conference (ITC). IEEE, 2024, pp. 323– 332

  43. [43]

    Graph matching networks for learning the similarity of graph structured objects,

    Y . Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli, “Graph matching networks for learning the similarity of graph structured objects,” inInternational conference on machine learning. PMLR, 2019, pp. 3835–3845

  44. [44]

    Graph clustering using k-neighbourhood attribute structural similarity,

    M. P. Boobalan, D. Lopez, and X. Z. Gao, “Graph clustering using k-neighbourhood attribute structural similarity,”Applied soft computing, vol. 47, pp. 216–223, 2016

  45. [45]

    A method to evaluate cfg compari- son algorithms,

    P. P. Chan and C. Collberg, “A method to evaluate cfg compari- son algorithms,” in2014 14th international conference on quality software. IEEE, 2014, pp. 95–104

  46. [46]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

  47. [47]

    Enhanced bilingual evaluation understudy,

    K. Maraseket al., “Enhanced bilingual evaluation understudy,” arXiv preprint arXiv:1509.09088, 2015

  48. [48]

    A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,

    M. Post, “A call for clarity in reporting bleu scores,”arXiv preprint arXiv:1804.08771, 2018