Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs
Pith reviewed 2026-05-24 08:08 UTC · model grok-4.3
The pith
LLMs reach 90%+ on syntax parsing tasks but stay below 70% on dynamic reasoning with strong project-to-project variability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that LLMs display a consistent capability hierarchy on fundamental code analysis: they generate abstract syntax trees at 90%+ accuracy and match expressions at 84-100%, perform adequately on static-semantics tasks such as control-flow graphs and taint analysis, yet remain below 70% on dynamic-reasoning tasks and exhibit per-project F1 scores that swing from 0 to 1.0. The same ordering appears across model families and sizes, which the authors interpret as evidence of structural rather than transient limits. They further note that LLMs supply cross-language generalization at the cost of nondeterministic outputs, while conventional tools supply deterministic results at the代价的
What carries the argument
A three-layer protocol that scores syntax parsing, static-semantics inference, and dynamic reasoning on the same code samples using automated metrics, expert adjudication, and consistency checks.
If this is right
- LLMs can supply cross-language analysis without per-language configuration that traditional tools require.
- LLM outputs on dynamic tasks will need external validation because of high data-shift sensitivity.
- Traditional analyzers remain necessary for tasks where deterministic guarantees matter.
- A hybrid workflow can route syntax and simple static checks to LLMs and route dynamic checks to conventional tools.
Where Pith is reading between the lines
- If the hierarchy is fundamental, fine-tuning focused only on dynamic tasks may still leave a residual gap unless the training distribution covers project diversity.
- The observed sensitivity to project shift suggests that LLMs may be capturing surface patterns rather than portable reasoning rules.
- Tool builders could expose an explicit 'dynamic-reasoning' flag so users know when to distrust zero-shot LLM answers.
Load-bearing premise
The nine chosen tasks and 3,124 code samples, scored by the three-layer protocol, give a representative picture of what counts as fundamental code analysis.
What would settle it
A new model that sustains above 80% F1 on the dynamic-reasoning tasks across multiple held-out projects without retraining would contradict the claim of a stable performance hierarchy.
Figures
read the original abstract
Code analysis is fundamental in Software Engineering, supporting debugging, optimization, and security assessment. Human developers approach it through syntax parsing, static semantics inference, and dynamic reasoning. Traditional tools are effective but limited by language specificity and weak cross-language generalization. Large language models (LLMs) are promising for code tasks, yet their capabilities for fundamental code analysis remain underexplored. We structure our study around three aspects aligned with human practices: syntax parsing, static semantics inference, and dynamic reasoning. We evaluate 21 state-of-the-art LLMs across nine tasks in four languages (C, Java, Python, Solidity), including AST generation, CFG construction, data dependency, taint analysis, and flaky test reasoning. We apply a three-layer evaluation protocol (automated metrics, expert adjudication, consistency validation) to 3,124 code samples, achieving high inter-rater reliability (Cohen's kappa = 0.844-0.936) and strong human-machine agreement (Gwet's AC1 = 0.500-0.727, F1 = 0.791-0.882). While the best LLMs excel in syntax parsing (AST 90%+, expression matching 84-100%) and show promise in static analysis, their dynamic reasoning remains limited (<70%) with high data-shift sensitivity (per-project F1 varying 0-1.0). This hierarchy holds across model families and scales, suggesting fundamental rather than transient limitations. These findings show how LLMs complement traditional analyzers: they offer cross-language generalization but non-deterministic outputs needing validation, while traditional tools give deterministic guarantees but need language-specific configuration. We contribute a validated evaluation framework with comparison against traditional analyzers (Tree-sitter, Soot, Joern) and task-specific applicability tiers. Benchmark: https://github.com/mathieu0905/llm_code_analysis.git
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates 21 state-of-the-art LLMs on nine tasks spanning syntax parsing (e.g., AST generation), static semantics inference (e.g., CFG construction, data dependency, taint analysis), and dynamic reasoning (e.g., flaky test reasoning) across C, Java, Python, and Solidity. Using 3,124 code samples and a three-layer protocol (automated metrics, expert adjudication, consistency validation) with reported high inter-rater reliability (Cohen's kappa 0.844-0.936) and human-machine agreement, it finds LLMs excel at syntax (AST 90%+, expression matching 84-100%), show promise in static analysis, but remain limited in dynamic reasoning (<70%) with high data-shift sensitivity (per-project F1 varying 0-1.0). This hierarchy holds across model families and scales, suggesting fundamental limitations; the work contributes a validated framework, applicability tiers, and benchmark comparing LLMs to traditional tools like Tree-sitter, Soot, and Joern.
Significance. If the evaluation protocol and sample representativeness hold, the results offer concrete empirical grounding for where LLMs can complement (via cross-language generalization) versus where they fall short of traditional deterministic analyzers in software engineering. The open benchmark and three-layer protocol with quantified agreement metrics are strengths that could support reproducible follow-up work on improving dynamic code reasoning.
major comments (1)
- [Abstract] Abstract: The central claim that the observed performance hierarchy (syntax strong, static promising, dynamic limited with data-shift sensitivity) reflects 'fundamental rather than transient limitations' is load-bearing on the representativeness of the nine tasks and 3,124 samples. The abstract provides no detail on sample selection criteria (e.g., distribution across projects, languages, or complexity levels) or how dynamic tasks isolate state reasoning versus prompt-following, leaving open the possibility that the <70% ceiling and 0-1.0 per-project F1 swings are artifacts of task design rather than intrinsic model properties.
minor comments (1)
- [Abstract] Abstract: The reported human-machine agreement metrics (Gwet's AC1 = 0.500-0.727, F1 = 0.791-0.882) are given as ranges without mapping to specific tasks or evaluation layers, reducing clarity on which aspects of the protocol drive the reliability claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The concern about the abstract's brevity and its implications for the central claim is well-taken. We address it point-by-point below and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the observed performance hierarchy (syntax strong, static promising, dynamic limited with data-shift sensitivity) reflects 'fundamental rather than transient limitations' is load-bearing on the representativeness of the nine tasks and 3,124 samples. The abstract provides no detail on sample selection criteria (e.g., distribution across projects, languages, or complexity levels) or how dynamic tasks isolate state reasoning versus prompt-following, leaving open the possibility that the <70% ceiling and 0-1.0 per-project F1 swings are artifacts of task design rather than intrinsic model properties.
Authors: We agree the abstract is concise and omits explicit details on sample selection and task isolation. The full manuscript (Sections 3.1–3.2 and 4) specifies that the 3,124 samples were drawn from multiple open-source projects per language, stratified by size and complexity, with dynamic tasks (e.g., flaky-test reasoning) explicitly constructed to require inference over execution history and state changes rather than surface-level prompt compliance. The per-project F1 variance is reported as direct evidence of data-shift sensitivity. To address the referee’s point, we will revise the abstract to include a brief clause on sample distribution and task design. The claim of fundamental limitations is presented as an inference from the consistent hierarchy across 21 models and four scales; we do not claim it is proven, only that the empirical pattern is not explained by transient factors such as model size alone. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations or self-referential fits
full rationale
The paper reports direct experimental results from evaluating 21 LLMs on nine fixed tasks using 3,124 samples and a three-layer protocol (automated metrics, expert adjudication, consistency checks). No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methodology. The hierarchy claim (syntax > static > dynamic) is presented as an observed pattern across models, not as a mathematical consequence of prior definitions or fits. This matches the default case of an empirical study whose central claims rest on external data rather than internal construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Zero-shot prompting is sufficient to elicit and measure core code analysis capabilities in LLMs
- domain assumption The selected tasks and samples are representative of fundamental code analysis as practiced by humans
Forward citations
Cited by 6 Pith papers
-
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.
-
NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification
NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.
-
LLM-Powered Detection of Price Manipulation in DeFi
PMDetector is a hybrid static-plus-LLM framework that detects price manipulation in DeFi protocols via taint analysis, defense filtering, attack simulation, and validation, achieving 88% precision and 90% recall on 73...
-
A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox
ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
-
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
LLMs achieve only 52% overall accuracy on HMSC semantic tasks, performing well on basic concepts but poorly on abstractions, compositions, and trace calculations.
-
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology
CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.
Reference graph
Works this paper leans on
-
[1]
2018. Slither. https://github.com/crytic/slither
work page 2018
-
[2]
Chatgpt: Optimizing language models for dialogue
2022-11. Chatgpt: Optimizing language models for dialogue . https://chat.openai. com
work page 2022
- [3]
- [4]
-
[5]
2023. awesome-chatgpt-prompts. https://github.com/f/awesome-chatgpt- prompts
work page 2023
-
[6]
Capabilities of ChatGPT for Code Analysis: An Empirical Study
2023. Capabilities of ChatGPT for Code Analysis: An Empirical Study . https: //sites.google.com/view/chatgpt4se
work page 2023
- [7]
-
[8]
2023. CG Analyzer. https://chat.openai.com/g/g-P5Qzq5vdB-call-graph-analyzer
work page 2023
-
[9]
2023. Copilot. https://github.com/features/copilot
work page 2023
- [10]
-
[11]
2023. Llama2. https://ai.meta.com/llama/
work page 2023
- [12]
- [13]
- [14]
-
[15]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang
-
[16]
arXiv preprint arXiv:2103.06333 (2021)
Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021)
-
[17]
Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon
-
[18]
arXiv preprint arXiv:2208.14799 (2022)
Predicting Flaky Tests Categories using Few-Shot Learning. arXiv preprint arXiv:2208.14799 (2022)
-
[19]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Frances E. Allen. 1970. Control Flow Analysis. In Proceedings of a Symposium on Compiler Optimization (Urbana-Champaign, Illinois). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/800028.808479
-
[21]
I.D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272) . 368–377. https://doi.org/10.1109/ICSM.1998. 738528
-
[22]
Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. Interna- tional Conference on Software Maintenance (Cat. No. 98CB36272) . IEEE, 368–377
work page 1998
-
[23]
Xiao Cheng, Haoyu Wang, Jiayi Hua, Miao Zhang, Guoai Xu, Li Yi, and Yulei Sui. 2019. Static detection of control-flow-related vulnerabilities using graph embedding. In 2019 24th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 41–50
work page 2019
-
[24]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017)
work page 2017
-
[25]
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning
-
[26]
What Does BERT Look At? An Analysis of BERT's Attention
What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[27]
Pascal Cuoq, Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2012. Frama-C. In Software Engineering and Formal Methods, George Eleftherakis, Mike Hinchey, and Mike Holcombe (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 233–247
work page 2012
-
[28]
Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Zhen Ming Jack Jiang. 2023. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software 203 (2023), 111734
work page 2023
-
[29]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [30]
-
[31]
Josselin Feist, Gustavo Greico, and Alex Groce. 2019. Slither: A Static Analysis Framework for Smart Contracts. In Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (Montreal, Quebec, Canada) (WETSEB ’19). IEEE Press, 8–15. https://doi.org/10.1109/WETSEB.2019. 00008
-
[32]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[33]
Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1984. The program depen- dence graph and its use in optimization. In International Symposium on Program- ming. Springer, 125–132
work page 1984
-
[34]
Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 9, 3 (jul 1987), 319–349. https://doi.org/10.1145/24039.24041
-
[35]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[36]
José Antonio Hernández López, Martin Weyssow, Jesús Sánchez Cuadrado, and Houari Sahraoui. 2022. AST-Probe: Recovering abstract syntax trees from hid- den representations of pre-trained language models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–11
work page 2022
-
[37]
John Hewitt and Christopher D. Manning. 2019. A Structural Probe for Find- ing Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers) . As- sociation for Computational Linguistics, Minneapolis, Minn...
- [38]
- [39]
-
[40]
Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. TreeBERT: A tree-based pre-trained model for programming language. In Uncertainty in Artificial Intelligence. PMLR, 54–63
work page 2021
-
[41]
Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings. InProceedings of the 2022 ACM SIGSAC Conference on Computer Conference’17, July 2017, Washington, DC, USA Wei Ma, Shangqing Liu, Zhihao Lin, Wenhan Wang, Qiang Hu, Cen Zhang, Ye Liu, Li Li, ...
-
[42]
Ken Kennedy. 1979. A survey of data flow analysis techniques . IBM Thomas J. Watson Research Division
work page 1979
-
[43]
Junhyoung Kim, TaeGuen Kim, and Eul Gyu Im. 2014. Survey of dynamic taint analysis. In 2014 4th IEEE International Conference on Network Infrastructure and Digital Content. IEEE, 269–272
work page 2014
-
[44]
Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using ab- stract syntax suffix trees. In 2006 13th Working Conference on Reverse Engineering . IEEE, 253–262
work page 2006
-
[45]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35
work page 2023
-
[47]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815
- [48]
- [49]
-
[50]
Shangqing Liu, Xiaofei Xie, Jingkai Siow, Lei Ma, Guozhu Meng, and Yang Liu
-
[51]
IEEE Transactions on Software Engineering (2023)
GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search. IEEE Transactions on Software Engineering (2023)
work page 2023
- [52]
-
[53]
James H Lubowitz. 2023. ChatGPT, an artificial intelligence chatbot, is impacting medical literature. Arthroscopy 39, 5 (2023), 1121–1122
work page 2023
-
[54]
Wei Ma, Mengjie Zhao, Ezekiel Soremekun, Qiang Hu, Jie M Zhang, Mike Pa- padakis, Maxime Cordy, Xiaofei Xie, and Yves Le Traon. 2022. GraphCode2Vec: generic code embedding via lexical and program dependence analyses. In Pro- ceedings of the 19th International Conference on Mining Software Repositories . 524–536
work page 2022
-
[55]
Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jiexin Zhang, Wenhan Wang, and Yang Liu. 2022. Are Code Pre-trained Models Powerful to Learn Code Syntax and Semantics? https://api.semanticscholar.org/CorpusID: 258556996
work page 2022
-
[56]
Murphy, David Notkin, William G
Gail C. Murphy, David Notkin, William G. Griswold, and Erica S. Lan. 1998. An Empirical Study of Static Call Graph Extractors. ACM Trans. Softw. Eng. Methodol. 7, 2 (apr 1998), 158–191. https://doi.org/10.1145/279310.279314
-
[57]
Anh Nguyen-Duc, Beatriz Cabrero-Daniel, Adam Przybylek, Chetan Arora, Dron Khanna, Tomas Herda, Usman Rafiq, Jorge Melegati, Eduardo Guerra, Kai-Kristian Kemell, et al. 2023. Generative Artificial Intelligence for Software Engineering–A Research Agenda. arXiv preprint arXiv:2310.18648 (2023)
-
[58]
Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo
-
[59]
SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2006–2018. https://doi.org/10.1145/3510003. 3510096
-
[60]
OpenAI. 2019. ChatGPT Demo. https://www.youtube.com/watch?v= outcGtbnMuQ&ab_channel=OpenAI
work page 2019
-
[61]
OpenAI. 2023. GPT-3.5 Turbo fine-tuning and API updates . https://openai.com/ blog/gpt-3-5-turbo-fine-tuning-and-api-updates
work page 2023
-
[62]
OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). https: //api.semanticscholar.org/CorpusID:257532815
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
OpenAI. 2023. GPT-4 Technical Report. arXiv (2023)
work page 2023
-
[64]
Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 936–946
work page 2015
-
[65]
Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: an analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378
work page 2019
- [66]
- [67]
-
[68]
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics 8 (01 2021), 842–866. https://doi.org/10.1162/tacl_a_00349 arXiv:https://direct.mit.edu/tacl/article- pdf/doi/10.1162/tacl_a_00349/1923281/tacl_a_00349.pdf
-
[69]
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8 (2021), 842–866
work page 2021
-
[70]
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh
-
[72]
Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020)
-
[73]
Yannis Smaragdakis, George Balatsouras, et al. 2015. Pointer analysis. Foundations and Trends® in Programming Languages 2, 1 (2015), 1–69
work page 2015
- [74]
-
[75]
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021
work page 2020
-
[76]
Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow analysis in LLVM. In Proceedings of the 25th international conference on compiler construction . ACM, 265–266
work page 2016
- [77]
-
[78]
Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP . Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 371–383. https://aclanthology.org/ 2022.blackboxnlp-1.31
work page 2022
-
[79]
Lewis Tunstall, Nathan Lambert, Nazneen Rajani, Edward Beeching, Teven Le Scao, Leandro von Werra, Sheon Han, Philipp Schmid, and Alexander Rush
-
[80]
Creating a Coding Assistant with StarCoder. Hugging Face Blog (2023). https://huggingface.co/blog/starchat
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.