BLAST: Benchmarking LLMs with ASP-based Structured Testing
Pith reviewed 2026-05-08 09:45 UTC · model grok-4.3
The pith
BLAST introduces the first dedicated benchmark and dataset for testing how accurately large language models generate Answer Set Programming code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BLAST provides the first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code, featuring two novel semantic metrics tailored to ASP code generation and presenting results from an empirical evaluation on ten well-established graph-related problems and eight state-of-the-art LLMs.
What carries the argument
BLAST, the benchmarking methodology that supplies a dataset of graph problems together with two semantic metrics for judging the correctness of generated ASP programs.
If this is right
- Enables direct comparisons of different LLMs on the same set of ASP generation tasks.
- Allows detection of programs that are syntactically valid but semantically incorrect.
- Supplies a reusable dataset that can serve as a baseline for future model improvements.
- Highlights the distinction between syntactic and semantic evaluation for declarative code.
Where Pith is reading between the lines
- The same testing structure could be applied to other declarative languages such as Prolog or Datalog.
- Prompt engineering techniques might be refined using the error patterns revealed by the semantic metrics.
- Automated verification tools could be integrated with the benchmark to scale evaluation beyond the initial ten problems.
Load-bearing premise
That the ten selected graph-related problems and the two semantic metrics are representative and sufficient to measure general LLM capability in ASP code generation across broader domains.
What would settle it
An experiment showing that LLMs achieve high accuracy on ASP tasks outside graph problems when evaluated with alternative semantic criteria would directly challenge the benchmark's generality.
Figures
read the original abstract
Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BLAST as the first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. It features a structured evaluation framework with two novel semantic metrics tailored to ASP, and reports empirical results from applying the benchmark to ten well-established graph-related problems using a diverse set of eight state-of-the-art LLMs.
Significance. If the metrics and evaluation hold up, this supplies a much-needed structured framework and public dataset for assessing LLM performance on declarative logic programming tasks, an area that has received less attention than imperative or functional code generation. The emphasis on semantic metrics (beyond syntax) and the release of a benchmark/dataset are concrete strengths that can support reproducible follow-on work and comparisons across models.
minor comments (3)
- [Abstract and §1] The abstract and introduction should explicitly list the ten graph-related problems (e.g., by name or reference to the ASP literature) and the eight LLMs evaluated, rather than describing them only generically.
- [§3 or §4] Provide a clear, self-contained definition or pseudocode for the two novel semantic metrics (including how they handle ASP-specific features such as stable models, answer sets, or grounding) in the main text or an appendix, so readers can reproduce the scoring without ambiguity.
- [§5 or §6] Include a brief discussion of potential limitations of restricting the benchmark to graph problems (e.g., whether results generalize to other ASP domains such as planning or constraint satisfaction) to avoid over-interpretation of the empirical findings.
Simulated Author's Rebuttal
We thank the referee for the positive summary, recognition of significance, and recommendation for minor revision. The value of BLAST as a structured framework and public dataset for evaluating LLMs on declarative ASP code generation is appreciated.
Circularity Check
No significant circularity; benchmark and metrics introduced as independent contributions
full rationale
The paper presents BLAST as a new benchmarking methodology, dataset, and two novel semantic metrics for evaluating LLMs on ASP code generation, with results shown on ten graph problems. No derivation chain, equations, predictions, or fitted parameters are described that reduce to self-referential inputs or prior self-citations. The central claims concern the provision of a structured framework and empirical evaluation, which stand as self-contained contributions without requiring external uniqueness theorems or ansatzes from the authors' prior work. This matches the expected honest non-finding for a benchmarking paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ASP semantics are well-defined and allow reliable equivalence checking between generated programs and expected answers
invented entities (1)
-
BLAST benchmark and dataset
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Association for Computational Linguistics. Alviano, M.; Calimeri, F.; Charwat, G.; Dao-Tran, M.; Do- daro, C.; Ianni, G.; Krennwallner, T.; Kronegger, M.; Oetsch, J.; Pfandler, A.; P ¨uhrer, J.; Redl, C.; Ricca, F.; Schneider, P.; Schwengerer, M.; Spendier, L. K.; Wallner, J. P.; and Xiao, G. 2013. The fourth answer set program- ming competition: Prelimin...
-
[2]
AAAI Press. Naveed, H.; Khan, A. U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; and Mian, A. 2024. A comprehensive overview of large language models. Nye, M. I.; Tessler, M. H.; Tenenbaum, J. B.; and Lake, B. M. 2021. Improving coherence and consistency in neu- ral sequence models with dual-system, neuro-symbolic reasoning. InNeurIP...
work page 2024
-
[3]
Qin, L.; Chen, Q.; Feng, X.; Wu, Y .; Zhang, Y .; Li, Y .; Li, M.; Che, W.; and Yu, P
The impact of AI on developer productivity: Ev- idence from github copilot. Qin, L.; Chen, Q.; Feng, X.; Wu, Y .; Zhang, Y .; Li, Y .; Li, M.; Che, W.; and Yu, P. S. 2024. Large language models meet nlp: A survey. Raiaan, M. A. K.; Mukta, M. S. H.; Fatema, K.; Fahad, N. M.; Sakib, S.; Mim, M. M. J.; Ahmad, J.; Ali, M. E.; and Azam, S. 2024. A review on la...
work page 2024
-
[4]
Ren, L.; Xiao, G.; Qi, G.; Geng, Y .; and Xue, H
Reliable natural language understanding with large language models and answer set programming. Ren, L.; Xiao, G.; Qi, G.; Geng, Y .; and Xue, H. 2025. Can llms solve asp problems? insights from a benchmark- ing study. InProceedings of the 22nd International Con- ference on Principles of Knowledge Representation and Reasoning, KR ’25. Schrader, T. P.; Lang...
-
[5]
Generating consistent PDDL domains with large language models. Team, G., and et al., P. G. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. Valmeekam, K.; Stechly, K.; Gundawar, A.; and Kambham- pati, S. 2025. A systematic evaluation of the planning and scheduling abilities of the reasoning model o1. Vaswani, A....
work page 2024
-
[6]
A systematic evaluation of large language models of code. InProceedings of the 6th ACM SIGPLAN Inter- national Symposium on Machine Programming, MAPS 2022, 1–10. New York, NY , USA: Association for Com- puting Machinery. Xu, K.; Mao, Y .; Guan, X.; and Feng, Z. 2025. Web-bench: A llm code benchmark based on web standards and frame- works. Yang, Z.; Ishay,...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.