Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs

Lms: Understanding code syntax, semantics for code analysis · 2023 · cs.SE · arXiv 2305.12138

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

Code analysis is fundamental in Software Engineering, supporting debugging, optimization, and security assessment. Human developers approach it through syntax parsing, static semantics inference, and dynamic reasoning. Traditional tools are effective but limited by language specificity and weak cross-language generalization. Large language models (LLMs) are promising for code tasks, yet their capabilities for fundamental code analysis remain underexplored. We structure our study around three aspects aligned with human practices: syntax parsing, static semantics inference, and dynamic reasoning. We evaluate 21 state-of-the-art LLMs across nine tasks in four languages (C, Java, Python, Solidity), including AST generation, CFG construction, data dependency, taint analysis, and flaky test reasoning. We apply a three-layer evaluation protocol (automated metrics, expert adjudication, consistency validation) to 3,124 code samples, achieving high inter-rater reliability (Cohen's kappa = 0.844-0.936) and strong human-machine agreement (Gwet's AC1 = 0.500-0.727, F1 = 0.791-0.882). While the best LLMs excel in syntax parsing (AST 90%+, expression matching 84-100%) and show promise in static analysis, their dynamic reasoning remains limited (<70%) with high data-shift sensitivity (per-project F1 varying 0-1.0). This hierarchy holds across model families and scales, suggesting fundamental rather than transient limitations. These findings show how LLMs complement traditional analyzers: they offer cross-language generalization but non-deterministic outputs needing validation, while traditional tools give deterministic guarantees but need language-specific configuration. We contribute a validated evaluation framework with comparison against traditional analyzers (Tree-sitter, Soot, Joern) and task-specific applicability tiers. Benchmark: https://github.com/mathieu0905/llm_code_analysis.git

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Large Language Models for Multi-Lingual Equivalent Mutant Detection: An Extended Empirical Study

cs.SE · 2026-07-01 · unverdicted · novelty 6.0

LLM-based methods achieve higher F1-scores than traditional approaches for equivalent mutant detection in Java and C, with fine-tuned code embeddings performing best and showing cross-lingual generalization.

(How) Do Large Language Models Understand High-Level Message Sequence Charts?

cs.SE · 2026-05-13 · conditional · novelty 6.0 · 2 refs

LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.

NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification

cs.SE · 2026-05-12 · unverdicted · novelty 6.0

NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.

LLM-Powered Detection of Price Manipulation in DeFi

cs.CR · 2025-10-24 · unverdicted · novelty 6.0

PMDetector is a hybrid static-plus-LLM framework that detects price manipulation in DeFi protocols via taint analysis, defense filtering, attack simulation, and validation, achieving 88% precision and 90% recall on 73 vulnerable plus 288 benign contracts.

A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox

cs.CR · 2026-05-20 · unverdicted · novelty 5.0

ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.

MAVEN: Improving Generalization in Agentic Tool Calling

cs.AI · 2026-05-29 · unverdicted · novelty 4.0

MAVEN is a modular verification scaffold that lifts an open 120b model's tool-calling accuracy from 48% to 71% on MAVEN-Bench without retraining.

CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology

cs.SE · 2024-02-02 · unverdicted · novelty 4.0

CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.

citing papers explorer

Showing 7 of 7 citing papers.

Large Language Models for Multi-Lingual Equivalent Mutant Detection: An Extended Empirical Study cs.SE · 2026-07-01 · unverdicted · none · ref 62 · internal anchor
LLM-based methods achieve higher F1-scores than traditional approaches for equivalent mutant detection in Java and C, with fine-tuned code embeddings performing best and showing cross-lingual generalization.
(How) Do Large Language Models Understand High-Level Message Sequence Charts? cs.SE · 2026-05-13 · conditional · none · ref 9 · 2 links · internal anchor
LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.
NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification cs.SE · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.
LLM-Powered Detection of Price Manipulation in DeFi cs.CR · 2025-10-24 · unverdicted · none · ref 53 · internal anchor
PMDetector is a hybrid static-plus-LLM framework that detects price manipulation in DeFi protocols via taint analysis, defense filtering, attack simulation, and validation, achieving 88% precision and 90% recall on 73 vulnerable plus 288 benign contracts.
A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox cs.CR · 2026-05-20 · unverdicted · none · ref 61 · internal anchor
ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
MAVEN: Improving Generalization in Agentic Tool Calling cs.AI · 2026-05-29 · unverdicted · none · ref 7 · internal anchor
MAVEN is a modular verification scaffold that lifts an open 120b model's tool-calling accuracy from 48% to 71% on MAVEN-Bench without retraining.
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology cs.SE · 2024-02-02 · unverdicted · none · ref 52 · internal anchor
CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.

Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer