pith. sign in

arxiv: 2604.22306 · v1 · submitted 2026-04-24 · 💻 cs.LO · cs.AI· cs.PL

BLAST: Benchmarking LLMs with ASP-based Structured Testing

Pith reviewed 2026-05-08 09:45 UTC · model grok-4.3

classification 💻 cs.LO cs.AIcs.PL
keywords Answer Set ProgrammingLarge Language ModelsCode GenerationBenchmarkingSemantic MetricsGraph Problems
0
0 comments X

The pith

BLAST introduces the first dedicated benchmark and dataset for testing how accurately large language models generate Answer Set Programming code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BLAST as a structured evaluation framework and dataset aimed at measuring LLMs' performance on generating code in Answer Set Programming, a declarative paradigm. It features two novel semantic metrics that assess functional correctness rather than surface syntax alone. The methodology applies this to ten established graph problems drawn from ASP literature and runs experiments across eight state-of-the-art models. A sympathetic reader would care because LLMs are widely used for code tasks yet lack systematic ways to judge their output in non-imperative languages like ASP.

Core claim

BLAST provides the first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code, featuring two novel semantic metrics tailored to ASP code generation and presenting results from an empirical evaluation on ten well-established graph-related problems and eight state-of-the-art LLMs.

What carries the argument

BLAST, the benchmarking methodology that supplies a dataset of graph problems together with two semantic metrics for judging the correctness of generated ASP programs.

If this is right

  • Enables direct comparisons of different LLMs on the same set of ASP generation tasks.
  • Allows detection of programs that are syntactically valid but semantically incorrect.
  • Supplies a reusable dataset that can serve as a baseline for future model improvements.
  • Highlights the distinction between syntactic and semantic evaluation for declarative code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same testing structure could be applied to other declarative languages such as Prolog or Datalog.
  • Prompt engineering techniques might be refined using the error patterns revealed by the semantic metrics.
  • Automated verification tools could be integrated with the benchmark to scale evaluation beyond the initial ten problems.

Load-bearing premise

That the ten selected graph-related problems and the two semantic metrics are representative and sufficient to measure general LLM capability in ASP code generation across broader domains.

What would settle it

An experiment showing that LLMs achieve high accuracy on ASP tasks outside graph problems when evaluated with alternative semantic criteria would directly challenge the benchmark's generality.

Figures

Figures reproduced from arXiv: 2604.22306 by Erica Coppolillo, Francesco Calimeri, Francesco Ricca, Giuseppe Manco, Manuel Alejandro Borroto Santana, Simona Perri.

Figure 1
Figure 1. Figure 1: Scheme of the overall proposed framework. view at source ↗
Figure 2
Figure 2. Figure 2: Semantic comparison on the considered problems com view at source ↗
Figure 3
Figure 3. Figure 3: Performance of GPT-4o as predicate matcher. Error bars view at source ↗
Figure 5
Figure 5. Figure 5: Syntactic and semantic performance obtained by the eval view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the semantic accuracy computed via the view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between the two semantic metrics in terms view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces BLAST as the first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. It features a structured evaluation framework with two novel semantic metrics tailored to ASP, and reports empirical results from applying the benchmark to ten well-established graph-related problems using a diverse set of eight state-of-the-art LLMs.

Significance. If the metrics and evaluation hold up, this supplies a much-needed structured framework and public dataset for assessing LLM performance on declarative logic programming tasks, an area that has received less attention than imperative or functional code generation. The emphasis on semantic metrics (beyond syntax) and the release of a benchmark/dataset are concrete strengths that can support reproducible follow-on work and comparisons across models.

minor comments (3)
  1. [Abstract and §1] The abstract and introduction should explicitly list the ten graph-related problems (e.g., by name or reference to the ASP literature) and the eight LLMs evaluated, rather than describing them only generically.
  2. [§3 or §4] Provide a clear, self-contained definition or pseudocode for the two novel semantic metrics (including how they handle ASP-specific features such as stable models, answer sets, or grounding) in the main text or an appendix, so readers can reproduce the scoring without ambiguity.
  3. [§5 or §6] Include a brief discussion of potential limitations of restricting the benchmark to graph problems (e.g., whether results generalize to other ASP domains such as planning or constraint satisfaction) to avoid over-interpretation of the empirical findings.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of significance, and recommendation for minor revision. The value of BLAST as a structured framework and public dataset for evaluating LLMs on declarative ASP code generation is appreciated.

Circularity Check

0 steps flagged

No significant circularity; benchmark and metrics introduced as independent contributions

full rationale

The paper presents BLAST as a new benchmarking methodology, dataset, and two novel semantic metrics for evaluating LLMs on ASP code generation, with results shown on ten graph problems. No derivation chain, equations, predictions, or fitted parameters are described that reduce to self-referential inputs or prior self-citations. The central claims concern the provision of a structured framework and empirical evaluation, which stand as self-contained contributions without requiring external uniqueness theorems or ansatzes from the authors' prior work. This matches the expected honest non-finding for a benchmarking paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The contribution rests on the assumption that graph problems adequately sample ASP challenges and that semantic metrics can be reliably computed; no free parameters or invented physical entities are involved.

axioms (1)
  • domain assumption ASP semantics are well-defined and allow reliable equivalence checking between generated programs and expected answers
    The two semantic metrics depend on this property of ASP.
invented entities (1)
  • BLAST benchmark and dataset no independent evidence
    purpose: Structured evaluation of LLMs on ASP code generation
    New methodology and data introduced by the paper

pith-pipeline@v0.9.0 · 5425 in / 1072 out tokens · 36220 ms · 2026-05-08T09:45:33.336752+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    Association for Computational Linguistics. Alviano, M.; Calimeri, F.; Charwat, G.; Dao-Tran, M.; Do- daro, C.; Ianni, G.; Krennwallner, T.; Kronegger, M.; Oetsch, J.; Pfandler, A.; P ¨uhrer, J.; Redl, C.; Ricca, F.; Schneider, P.; Schwengerer, M.; Spendier, L. K.; Wallner, J. P.; and Xiao, G. 2013. The fourth answer set program- ming competition: Prelimin...

  2. [2]

    Naveed, H.; Khan, A

    AAAI Press. Naveed, H.; Khan, A. U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; and Mian, A. 2024. A comprehensive overview of large language models. Nye, M. I.; Tessler, M. H.; Tenenbaum, J. B.; and Lake, B. M. 2021. Improving coherence and consistency in neu- ral sequence models with dual-system, neuro-symbolic reasoning. InNeurIP...

  3. [3]

    Qin, L.; Chen, Q.; Feng, X.; Wu, Y .; Zhang, Y .; Li, Y .; Li, M.; Che, W.; and Yu, P

    The impact of AI on developer productivity: Ev- idence from github copilot. Qin, L.; Chen, Q.; Feng, X.; Wu, Y .; Zhang, Y .; Li, Y .; Li, M.; Che, W.; and Yu, P. S. 2024. Large language models meet nlp: A survey. Raiaan, M. A. K.; Mukta, M. S. H.; Fatema, K.; Fahad, N. M.; Sakib, S.; Mim, M. M. J.; Ahmad, J.; Ali, M. E.; and Azam, S. 2024. A review on la...

  4. [4]

    Ren, L.; Xiao, G.; Qi, G.; Geng, Y .; and Xue, H

    Reliable natural language understanding with large language models and answer set programming. Ren, L.; Xiao, G.; Qi, G.; Geng, Y .; and Xue, H. 2025. Can llms solve asp problems? insights from a benchmark- ing study. InProceedings of the 22nd International Con- ference on Principles of Knowledge Representation and Reasoning, KR ’25. Schrader, T. P.; Lang...

  5. [5]

    Team, G., and et al., P

    Generating consistent PDDL domains with large language models. Team, G., and et al., P. G. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. Valmeekam, K.; Stechly, K.; Gundawar, A.; and Kambham- pati, S. 2025. A systematic evaluation of the planning and scheduling abilities of the reasoning model o1. Vaswani, A....

  6. [6]

    InProceedings of the 6th ACM SIGPLAN Inter- national Symposium on Machine Programming, MAPS 2022, 1–10

    A systematic evaluation of large language models of code. InProceedings of the 6th ACM SIGPLAN Inter- national Symposium on Machine Programming, MAPS 2022, 1–10. New York, NY , USA: Association for Com- puting Machinery. Xu, K.; Mao, Y .; Guan, X.; and Feng, Z. 2025. Web-bench: A llm code benchmark based on web standards and frame- works. Yang, Z.; Ishay,...