pith. sign in

arxiv: 2605.09997 · v2 · pith:FFP2KWYVnew · submitted 2026-05-11 · 💻 cs.SI · cs.SE

GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

Pith reviewed 2026-05-20 23:08 UTC · model grok-4.3

classification 💻 cs.SI cs.SE
keywords graph generationLLM benchmarkprompting strategiescapability gapsinstruction followingcomplexity levelsgraph synthesisverification-guided iteration
0
0 comments X

The pith

Progressive benchmark diagnoses LLM graph generation gaps at multi-constraint levels, overcome by verification-guided prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents GraphInstruct as a way to test LLMs on generating graphs from instructions, organized by six levels of increasing complexity and five scoring dimensions. It uses 800 instructions and over 1,500 reference graphs to see exactly where models lose the ability to follow rules. The results indicate that mixing several constraints reveals the clearest differences between models, more than deeper reasoning does, and that standard prompting methods do not consistently solve the issues. Building on these signals, the authors develop an iterative approach that verifies outputs and adapts prompts to handle constraints better, showing gains over regular techniques. This setup helps because accurate graph creation matters for analyzing networks, designing molecules, and building knowledge bases.

Core claim

By organizing graph synthesis tasks into six progressive complexity levels paired with five evaluation dimensions, GraphInstruct localizes where LLMs fail during instruction-following generation. Evaluations across twelve models and forty-five strategy combinations demonstrate that the most distinguishing failures occur during multi-constraint composition rather than reasoning depth, with domain-semantic constraints proving resistant to iterative improvement. A verification-guided iterative framework employing constraint-aware adaptive prompting consistently exceeds the performance limits of conventional prompt engineering on the tested models.

What carries the argument

The verification-guided iterative framework with constraint-aware adaptive prompting, which uses output verification to dynamically adjust prompts and better meet the specified graph constraints at each complexity level.

If this is right

  • Discriminative power of the benchmark reaches its peak when evaluating multi-constraint composition tasks.
  • No prompting strategy proves superior across every complexity level and every model family tested.
  • Constraints involving domain semantics stay difficult to satisfy even after multiple iterations of prompting.
  • The detailed breakdown from the benchmark supports the creation of improved graph generation techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Additional progress in this area may require integrating external retrieval systems to supply missing domain knowledge rather than relying solely on model iteration.
  • The layered complexity design could help diagnose similar instruction-following problems in other structured output domains such as molecular formulas or social network models.
  • Public release of the instructions and references invites independent checks on whether the identified gaps persist in newer or larger language models.

Load-bearing premise

The manually defined six complexity levels and five evaluation dimensions successfully isolate separate LLM capability gaps without creating unintended measurement biases or overlaps.

What would settle it

If evaluations using newly authored instructions at the same six levels produce different patterns of model failures or if the iterative framework shows no consistent gains on additional LLMs.

Figures

Figures reproduced from arXiv: 2605.09997 by Changjun Jiang, Sheng Xiang, Ying Zhang, Zihe Wei.

Figure 1
Figure 1. Figure 1: The GraphInstruct benchmark framework. The Progressive Instruction Layer (L0–L5) [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: The GraphInstruct benchmark framework. The Progressive Instruction Layer (L0–L5) [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GraphInstruct dataset overview. Left: per-level instruction count. Center: graph-size [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: GraphInstruct dataset overview. Left: per-level instruction count. Center: graph-size [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-level Quality by capability tier, averaged over the 45 (model, strategy) configurations in [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-level Quality by capability tier (Tab. 12 values). The T1–T3 gap at L2 (0.219) is 1.8–3 [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-instruction D1 standard deviation by level, averaged over 10 zero-shot models. L2 [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-instruction D1 standard deviation by level, averaged over 10 zero-shot models. L2 [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Capability-gap case study at L2 (instruction L2-143). Reference (left) and Sonnet-4.6 [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Capability-gap case study at L2 (instruction L2-143). Reference (left) and Sonnet-4.6 [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt sensitivity (σstrat, y-axis) vs. base capability (mean Q, x-axis) across the 11 fully￾evaluated models (Sonnet-4 excluded, zero-shot-only). The 4× gap between weakest T3 models (σstrat = 0.074) and most prompt-stable T2 models (σstrat = 0.019) establishes an inverse-scaling relation; the solid line is an OLS fit (R2 = 0.62). Implications. Prompt-engineering budgets should scale inversely with model … view at source ↗
Figure 6
Figure 6. Figure 6: Prompt sensitivity (σstrat, y-axis; population std of Q across the four strate￾gies) vs. base capability (mean Q, x-axis) across the 11 fully-evaluated models (Sonnet-4 excluded, zero-shot-only). The 4× gap between weakest T3 models (σstrat≈0.027, equiva￾lent Q-range ≈ 0.073) and most prompt-stable T2 models (σstrat≈0.008, range ≈ 0.019) establishes an inverse-scaling trend; the solid OLS line has slope −0… view at source ↗
Figure 7
Figure 7. Figure 7: Signed strategy × level effect heatmap (average over the 11 fully-evaluated models). Few-shot is net-negative at L2 (−0.034) and net-positive at L4 (+0.069); few-CoT swings from net-negative at L3 (−0.048) to net-positive at L5 (+0.045). Aggregate benchmarks mask these opposite-signed effects. savior at L4, where domain examples convey structural priors the instruction alone cannot. Few-CoT is savior at L5… view at source ↗
Figure 7
Figure 7. Figure 7: Signed strategy × level effect heatmap (average over the 11 fully-evaluated models). Few-shot is net-negative at L2 (−0.034) and net-positive at L4 (+0.069); few-CoT swings from net-negative at L3 (−0.048) to net-positive at L5 (+0.045). Aggregate benchmarks mask these opposite-signed effects. Mechanism. No strategy dominates: every non-trivial strategy is net-harmful at at least one level and net-helpful … view at source ↗
Figure 8
Figure 8. Figure 8: Signed CoT effect by model family. Qwen3.5 gains uniformly across scales ( [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 8
Figure 8. Figure 8: Signed CoT effect by model family. Qwen3.5 gains uniformly across scales ( [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qwen3.5 scale family (35B / 122B / 397B) per-level Quality. Scaling monotonically [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qwen3.5 scale family (35B / 122B / 397B) per-level Quality. Scaling monotonically [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pareto frontier over 45 baseline (model, strategy) configurations in [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pareto frontier over 45 baseline (model, strategy) configurations in [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Frontier Distance across 45 baseline configurations, sorted ascending. Top: 6 Pareto [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 11
Figure 11. Figure 11: Frontier Distance across 45 baseline configurations, sorted ascending. Top: 6 Pareto [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Method × model Quality with per-model Oracle reference line. Combined surpasses Oracle by +0.035–+0.050 on every target model; VGIG-only contributes the majority of the gain [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Method × model Quality with per-model Oracle reference line. Combined surpasses Oracle by +0.035–+0.050 on every target model; VGIG-only contributes the majority of the gain. Mechanism. Prompt engineering has a measurable empirical ceiling; external programmatic verification—not prompt phrasing—is the binding mechanism for reliable structured graph generation. The margin is robust: +0.035 is 7× the ±0.005… view at source ↗
Figure 13
Figure 13. Figure 13: E6 feedback-granularity ablation on GPT-4o-mini at [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 13
Figure 13. Figure 13: E6 feedback-granularity ablation on GPT-4o-mini at [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: E5 rounds-saturation curve. Quality improves substantially from [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 14
Figure 14. Figure 14: E5 rounds-saturation curve. Quality improves substantially from [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: L4 quality across T ∈ {1, 2, 3, 5, 7, 10, 15, 20} for fine/coarse/none feedback (24 configu￾rations). Flat at 0.750–0.754, indicating semantic-constraint failure is a structurally distinct mode iterative refinement cannot address. Mechanism. Two conclusions follow. First, the effective refinement horizon on verifiable graph constraints is ∼5 rounds—markedly shorter than text-domain self-refine budgets of … view at source ↗
Figure 15
Figure 15. Figure 15: L4 quality across T ∈ {1, 2, 3, 5, 7, 10, 15, 20} for fine/coarse/none feedback (24 configu￾rations). Flat at 0.750–0.754, indicating semantic-constraint failure is a structurally distinct mode iterative refinement cannot address. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: L4 per-dimension decomposition across 10 zero-shot models. D1 (structural), D3 [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
Figure 16
Figure 16. Figure 16: L4 per-dimension decomposition across 10 zero-shot models. D1 (structural), D3 [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Per-level capability profiles for six representative models (zero-shot). Each axis shows [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗
Figure 17
Figure 17. Figure 17: Per-level capability profiles for six representative models (zero-shot). Each axis shows [PITH_FULL_IMAGE:figures/full_fig_p039_17.png] view at source ↗
read the original abstract

Graph-structured data underpins applications from citation analysis and social-network modeling to molecular design and knowledge-graph construction, and Large Language Models (LLMs) are increasingly used as prompt-driven graph synthesizers. Classical graph-generation reviews catalog deep generative models and their evaluation primitives, but predate the LLM era and provide no foundation for evaluating instruction-following graph synthesis. Recent LLM-era benchmarks evaluate models along graph-type or task-domain axes; such organizations, however, average over structural complexity and cannot localize where in the complexity spectrum an LLM breaks down. To close this diagnostic gap, we introduce GraphInstruct, a progressive-complexity benchmark that stratifies LLM graph generation into six complexity levels and five evaluation dimensions, paired with 800 hand-authored instructions, 1,582 algorithmically synthesized reference solutions, and a 12-LLM capability evaluation across 45 (model, strategy) configurations. We find that discriminative power peaks at multi-constraint composition rather than reasoning depth, that no single prompting strategy dominates across levels or model families, and that domain-semantic constraints remain iteration-invariant under all tested methods -- pointing to retrieval rather than additional compute as the next research frontier. Atop the benchmark, a verification-guided iterative framework with constraint-aware adaptive prompting consistently surpasses the prompt-engineering ceiling on tested target models, demonstrating that the benchmark's fine-grained signals drive method development. Data, code, and reproducibility artifacts are released alongside the paper at https://github.com/AI4DataSynth/GraphInstruct_formal

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GraphInstruct, a progressive benchmark for diagnosing capability gaps in LLM-based graph generation. It stratifies tasks into six complexity levels and five evaluation dimensions using 800 hand-authored instructions and 1,582 algorithmically synthesized reference solutions. The paper evaluates 12 LLMs across 45 (model, strategy) configurations, reports findings on discriminative power peaking at multi-constraint composition, iteration invariance of domain-semantic constraints, and introduces a verification-guided iterative framework with constraint-aware adaptive prompting that outperforms standard prompt engineering.

Significance. If the benchmark's complexity levels and dimensions accurately isolate distinct capability gaps without biases or overlaps, this work provides a valuable diagnostic tool for advancing LLM graph synthesis capabilities beyond existing task-domain or graph-type benchmarks. The empirical findings on where models break down and the public release of data, code, and reproducibility artifacts at the GitHub repository are strengths that could guide targeted method development in structured data generation.

major comments (3)
  1. [Benchmark Construction] Benchmark Construction section: The six progressive complexity levels and five evaluation dimensions are constructed from hand-authored instructions and algorithmically synthesized references, but no explicit validation (e.g., correlation analysis, orthogonality tests, or overlap checks between multi-constraint composition and domain-semantic requirements) is described to confirm they isolate distinct capability gaps without unintended biases. This is load-bearing for the central claim that the benchmark supplies fine-grained, non-confounded signals.
  2. [Reference Synthesis] Reference Synthesis subsection: The 1,582 algorithmically synthesized reference solutions are used as ground truth for evaluation, yet no manual verification, correctness sampling, or fidelity checks against the instructions are reported. This directly affects the reliability of the reported discriminative power and iteration-invariance findings.
  3. [Results and Analysis] Results and Analysis section: The claim that 'discriminative power peaks at multi-constraint composition rather than reasoning depth' and that 'no single prompting strategy dominates' requires the specific quantitative metric (e.g., accuracy delta, statistical test) and per-level breakdown used to establish the peak and dominance patterns.
minor comments (2)
  1. [Abstract] The abstract states that domain-semantic constraints 'remain iteration-invariant under all tested methods' but does not clarify whether this holds uniformly across the six complexity levels or only in aggregate.
  2. [Related Work] The paper would benefit from an explicit comparison table contrasting GraphInstruct against prior LLM graph benchmarks along the axes of complexity stratification and diagnostic granularity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of benchmark validation and empirical reporting that we address point by point below. We have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark Construction section: The six progressive complexity levels and five evaluation dimensions are constructed from hand-authored instructions and algorithmically synthesized references, but no explicit validation (e.g., correlation analysis, orthogonality tests, or overlap checks between multi-constraint composition and domain-semantic requirements) is described to confirm they isolate distinct capability gaps without unintended biases. This is load-bearing for the central claim that the benchmark supplies fine-grained, non-confounded signals.

    Authors: We agree that explicit empirical validation would strengthen the central claim. The levels were designed following a logical progression grounded in graph-theoretic notions of structural complexity (e.g., from single-edge to multi-constraint compositions), and the five dimensions were chosen to separate structural, semantic, and constraint-based aspects. However, we did not report correlation or orthogonality statistics in the original submission. In the revised manuscript we will add a new subsection with pairwise correlation analysis across levels, overlap statistics between multi-constraint and domain-semantic instructions, and a brief orthogonality check using instruction embedding similarity. These additions will be placed in the Benchmark Construction section. revision: yes

  2. Referee: [Reference Synthesis] Reference Synthesis subsection: The 1,582 algorithmically synthesized reference solutions are used as ground truth for evaluation, yet no manual verification, correctness sampling, or fidelity checks against the instructions are reported. This directly affects the reliability of the reported discriminative power and iteration-invariance findings.

    Authors: The referee is correct that no manual verification sampling was described. The references were generated via a deterministic algorithmic pipeline that directly implements the instructions using standard graph libraries, with built-in consistency checks for basic validity (e.g., node/edge counts and constraint satisfaction). To address the concern, we will add a paragraph reporting a random sample of 100 references that were manually inspected for fidelity to the corresponding instructions, along with the observed error rate. This verification procedure and its results will be included in the revised Reference Synthesis subsection. revision: yes

  3. Referee: [Results and Analysis] Results and Analysis section: The claim that 'discriminative power peaks at multi-constraint composition rather than reasoning depth' and that 'no single prompting strategy dominates' requires the specific quantitative metric (e.g., accuracy delta, statistical test) and per-level breakdown used to establish the peak and dominance patterns.

    Authors: We will clarify the supporting evidence. Discriminative power was quantified using the range of accuracy scores across models at each level (max–min accuracy delta), and the peak at multi-constraint composition was identified by comparing these deltas across the six levels. The statement that no single prompting strategy dominates is based on the observation that the best-performing strategy varies by model family and level, with no strategy achieving top rank in more than two levels. In the revision we will insert the exact delta values, a table with per-level accuracy ranges, and the per-strategy ranking breakdown to make these claims fully traceable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark and evaluations are self-contained

full rationale

The paper constructs GraphInstruct via 800 hand-authored instructions and 1,582 synthesized references to define six complexity levels and five evaluation dimensions, then reports empirical observations from evaluating 12 LLMs across 45 configurations. Central findings (discriminative power peaking at multi-constraint composition, domain-semantic constraints being iteration-invariant, and the verification-guided framework surpassing prompt-engineering baselines) are direct results of these tests rather than quantities defined in terms of themselves or forced by fitted parameters. No mathematical derivations, self-referential equations, or load-bearing self-citations reduce any claim to its inputs by construction. The benchmark and method development are presented as independent contributions with released artifacts for external verification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper rests on the assumption that hand-crafted instructions and algorithmic references faithfully represent increasing graph complexity; no free parameters are described, but the six levels and five dimensions are invented constructs whose validity is not independently evidenced in the abstract.

axioms (1)
  • domain assumption LLMs can be meaningfully evaluated as prompt-driven graph synthesizers using natural language instructions.
    Invoked in the opening motivation for the benchmark.
invented entities (2)
  • Six progressive complexity levels no independent evidence
    purpose: To stratify graph generation tasks so failures can be localized along the complexity spectrum.
    Newly defined stratification not present in prior graph-generation reviews.
  • Five evaluation dimensions no independent evidence
    purpose: To assess generated graphs beyond simple type or domain axes.
    Invented to provide finer diagnostic signals.

pith-pipeline@v0.9.0 · 5806 in / 1381 out tokens · 42460 ms · 2026-05-20T23:08:23.919525+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 10 internal anchors

  1. [1]

    Graph Generators:

    Bonifati, Angela and Holub. Graph Generators:. 2020 , volume =

  2. [2]

    Xiang, Sheng and Wen, Dong and Cheng, Dawei and Zhang, Ying and Qin, Lu and Qian, Zhengping and Lin, Xuemin , title =. The. 2022 , volume =

  3. [3]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) Student Research Workshop , year =

    Demirci, Ege and Kerur, Rithwik and Singh, Ambuj , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) Student Research Workshop , year =

  4. [4]

    arXiv preprint arXiv:2403.14358 , year =

    Yao, Yang and Wang, Xin and Zhang, Zeyang and Qin, Yijian and Wang, Ziwei and Chu, Xu and Yang, Yuekui and Zhu, Wenwu and Mei, Hong , title =. arXiv preprint arXiv:2403.14358 , year =

  5. [5]

    International Conference on Learning Representations (ICLR) , year =

    Tang, Jianheng and Zhang, Qifan and Li, Yuhan and Liu, Nuo and Hua, Hongzhi and Jin, Jiawei and Wang, Yi and Huang, Xiao , title =. International Conference on Learning Representations (ICLR) , year =

  6. [6]

    Findings of the Association for Computational Linguistics (ACL) , year =

    Wang, Jianing and Wu, Junda and Hou, Yupeng and Liu, Yao and Gao, Ming and McAuley, Julian , title =. Findings of the Association for Computational Linguistics (ACL) , year =

  7. [7]

    International Conference on Learning Representations (ICLR) , year =

    Peng, Jie and Ji, Jiarui and Lei, Runlin and Wei, Zhewei and Liu, Yongchao and Hong, Chuntao , title =. International Conference on Learning Representations (ICLR) , year =

  8. [8]

    International Conference on Learning Representations (ICLR) , year =

    Fatemi, Bahare and Halcrow, Jonathan and Perozzi, Bryan , title =. International Conference on Learning Representations (ICLR) , year =

  9. [9]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Wang, Heng and Feng, Shangbin and He, Tianxing and Tan, Zhaoxuan and Han, Xiaochuang and Tsvetkov, Yulia , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  10. [10]

    Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) , year =

    Chen, Nuo and Li, Yuhan and Tang, Jianheng and Li, Jia , title =. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) , year =

  11. [11]

    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) , year =

    Tang, Jiabin and Yang, Yuhao and Wei, Wei and Shi, Lei and Su, Lixin and Cheng, Suqi and Yin, Dawei and Huang, Chao , title =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) , year =

  12. [12]

    Findings of the Association for Computational Linguistics (ACL) , year =

    Jin, Bowen and Xie, Chulin and Zhang, Jiawei and Roy, Kashob Kumar and Zhang, Yu and Li, Zheng and Li, Ruirui and Tang, Xianfeng and Wang, Suhang and Meng, Yu and Han, Jiawei , title =. Findings of the Association for Computational Linguistics (ACL) , year =

  13. [13]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

    Besta, Maciej and Blach, Nils and Kubicek, Ales and Gerstenberger, Robert and Podstawski, Michal and Gianinazzi, Lukas and Gajda, Joanna and Lehmann, Tomasz and Niewiadomski, Hubert and Nyczyk, Piotr and Hoefler, Torsten , title =. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

  14. [14]

    International Conference on Learning Representations (ICLR) , year =

    Luo, Linhao and Li, Yuan-Fang and Haffari, Gholamreza and Pan, Shirui , title =. International Conference on Learning Representations (ICLR) , year =

  15. [15]

    Findings of the Association for Computational Linguistics (EACL) , year =

    Ye, Ruosong and Zhang, Caiqi and Wang, Runhui and Xu, Shuyuan and Zhang, Yongfeng , title =. Findings of the Association for Computational Linguistics (EACL) , year =

  16. [16]

    and Kaiser, Lukasz and Polosukhin, Illia , title =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  17. [17]

    and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D

    Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D. and Dhariwal, Prafulla and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  18. [18]

    arXiv preprint arXiv:2303.08774 , year =

  19. [19]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, Hugo and Martin, Louis and Stone, Kevin and others , title =. arXiv preprint arXiv:2307.09288 , year =

  20. [20]

    The Llama 3 Herd of Models

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , title =. arXiv preprint arXiv:2407.21783 , year =

  21. [21]

    Qwen2 Technical Report

    Yang, An and Yang, Baosong and Hui, Binyuan and Zheng, Bo and others , title =. arXiv preprint arXiv:2407.10671 , year =

  22. [22]

    Qwen2.5 Technical Report

    Yang, An and others , title =. arXiv preprint arXiv:2412.15115 , year =

  23. [23]

    arXiv preprint arXiv:2412.19437 , year =

  24. [24]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Chi, Ed and Le, Quoc and Zhou, Denny , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  25. [25]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  26. [26]

    International Conference on Learning Representations (ICLR) , year =

    Wang, Xuezhi and Wei, Jason and Schuurmans, Dale and Le, Quoc and Chi, Ed and Narang, Sharan and Chowdhery, Aakanksha and Zhou, Denny , title =. International Conference on Learning Representations (ICLR) , year =

  27. [27]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas and Cao, Yuan and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  28. [28]

    International Conference on Learning Representations (ICLR) , year =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , title =. International Conference on Learning Representations (ICLR) , year =

  29. [29]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  30. [30]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  31. [31]

    International Conference on Learning Representations (ICLR) , year =

    Gou, Zhibin and Shao, Zhihong and Gong, Yeyun and Shen, Yelong and Yang, Yujiu and Duan, Nan and Chen, Weizhu , title =. International Conference on Learning Representations (ICLR) , year =

  32. [32]

    International Conference on Learning Representations (ICLR) , year =

    Huang, Jie and Chen, Xinyun and Mishra, Swaroop and Zheng, Huaixiu Steven and Yu, Adams Wei and Song, Xinying and Zhou, Denny , title =. International Conference on Learning Representations (ICLR) , year =

  33. [33]

    International Conference on Learning Representations (ICLR) , year =

    Zhou, Denny and Scharli, Nathanael and Hou, Le and Wei, Jason and Scales, Nathan and Wang, Xuezhi and Schuurmans, Dale and Cui, Claire and Bousquet, Olivier and Le, Quoc and Chi, Ed , title =. International Conference on Learning Representations (ICLR) , year =

  34. [34]

    and Welling, Max , title =

    Kipf, Thomas N. and Welling, Max , title =. International Conference on Learning Representations (ICLR) , year =

  35. [35]

    Graph Attention Networks , booktitle =

    Veli. Graph Attention Networks , booktitle =

  36. [36]

    and Ying, Rex and Leskovec, Jure , title =

    Hamilton, William L. and Ying, Rex and Leskovec, Jure , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  37. [37]

    International Conference on Learning Representations (ICLR) , year =

    Xu, Keyulu and Hu, Weihua and Leskovec, Jure and Jegelka, Stefanie , title =. International Conference on Learning Representations (ICLR) , year =

  38. [38]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  39. [39]

    and Leskovec, Jure , title =

    You, Jiaxuan and Ying, Rex and Ren, Xiang and Hamilton, William L. and Leskovec, Jure , title =. International Conference on Machine Learning (ICML) , year =

  40. [40]

    International Conference on Learning Representations (ICLR) , year =

    Shi, Chence and Xu, Minkai and Zhu, Zhaocheng and Zhang, Weinan and Zhang, Ming and Tang, Jian , title =. International Conference on Learning Representations (ICLR) , year =

  41. [41]

    ICML 2018 Deep Generative Models Workshop , year =

    De Cao, Nicola and Kipf, Thomas , title =. ICML 2018 Deep Generative Models Workshop , year =

  42. [42]

    International Conference on Artificial Neural Networks (ICANN) , year =

    Simonovsky, Martin and Komodakis, Nikos , title =. International Conference on Artificial Neural Networks (ICANN) , year =

  43. [43]

    International Conference on Machine Learning (ICML) , year =

    Jin, Wengong and Barzilay, Regina and Jaakkola, Tommi , title =. International Conference on Machine Learning (ICML) , year =

  44. [44]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    You, Jiaxuan and Liu, Bowen and Ying, Zhitao and Pande, Vijay and Leskovec, Jure , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  45. [45]

    International Conference on Learning Representations (ICLR) , year =

    Vignac, Clement and Krawczuk, Igor and Siraudin, Antoine and Wang, Bohan and Cevher, Volkan and Frossard, Pascal , title =. International Conference on Learning Representations (ICLR) , year =

  46. [46]

    International Conference on Machine Learning (ICML) , year =

    Jo, Jaehyeong and Lee, Seul and Hwang, Sung Ju , title =. International Conference on Machine Learning (ICML) , year =

  47. [47]

    Transactions on Machine Learning Research (TMLR) , year =

    Liang, Percy and Bommasani, Rishi and Lee, Tony and Tsipras, Dimitris and Soylu, Dilara and Yasunaga, Michihiro and others , title =. Transactions on Machine Learning Research (TMLR) , year =

  48. [48]

    Transactions on Machine Learning Research (TMLR) , year =

    Srivastava, Aarohi and Rastogi, Abhinav and Rao, Abhishek and Shoeb, Abu Awal Md and Abid, Abubakar and others , title =. Transactions on Machine Learning Research (TMLR) , year =

  49. [49]

    International Conference on Learning Representations (ICLR) , year =

    Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , title =. International Conference on Learning Representations (ICLR) , year =

  50. [50]

    Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and others , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

  51. [51]

    Training Verifiers to Solve Math Word Problems

    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and others , title =. arXiv preprint arXiv:2110.14168 , year =

  52. [52]

    Evaluating Large Language Models Trained on Code

    Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and others , title =. arXiv preprint arXiv:2107.03374 , year =

  53. [53]

    Challenging

    Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics (ACL) , year =

  54. [54]

    Scaling Laws for Neural Language Models

    Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and others , title =. arXiv preprint arXiv:2001.08361 , year =

  55. [55]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  56. [56]

    Transactions on Machine Learning Research (TMLR) , year =

    Wei, Jason and Tay, Yi and Bommasani, Rishi and Raffel, Colin and Zoph, Barret and Borgeaud, Sebastian and others , title =. Transactions on Machine Learning Research (TMLR) , year =

  57. [57]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Schaeffer, Rylan and Miranda, Brando and Koyejo, Sanmi , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  58. [58]

    and Khashabi, Daniel and Hajishirzi, Hannaneh , title =

    Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

  59. [59]

    International Conference on Learning Representations (ICLR) , year =

    Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Jiang, Daxin , title =. International Conference on Learning Representations (ICLR) , year =

  60. [60]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  61. [61]

    Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

    Dong, Yixin and Ruan, Charlie F. and Cai, Yaxing and Lai, Ruihang and Xu, Ziyi and Zhao, Yilong and Chen, Tianqi , title =. arXiv preprint arXiv:2411.15100 , year =

  62. [62]

    and Louf, R\'

    Willard, Brandon T. and Louf, R\'. Efficient Guided Generation for Large Language Models , journal =

  63. [63]

    Proceedings of the ACM on Programming Languages , volume =

    Beurer-Kellner, Luca and Fischer, Marc and Vechev, Martin , title =. Proceedings of the ACM on Programming Languages , volume =. 2023 , doi =

  64. [64]

    Emergence of Scaling in Random Networks , journal =

    Barab. Emergence of Scaling in Random Networks , journal =

  65. [65]

    and Strogatz, Steven H

    Watts, Duncan J. and Strogatz, Steven H. , title =. Nature , volume =. 1998 , doi =

  66. [66]

    Network Science , publisher =

    Barab. Network Science , publisher =

  67. [67]

    Newman, Mark , title =

  68. [68]

    Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

  69. [69]

    Text Summarization Branches Out: Proceedings of the ACL Workshop , year =

    Lin, Chin-Yew , title =. Text Summarization Branches Out: Proceedings of the ACL Workshop , year =

  70. [70]

    and Artzi, Yoav , title =

    Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , title =. International Conference on Learning Representations (ICLR) , year =

  71. [71]

    and Rasch, Malte J

    Gretton, Arthur and Borgwardt, Karsten M. and Rasch, Malte J. and Sch. A Kernel Two-Sample Test , journal =

  72. [72]

    Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

  73. [73]

    AI Magazine , volume =

    Sen, Prithviraj and Namata, Galileo and Bilgic, Mustafa and Getoor, Lise and Galligher, Brian and Eliassi-Rad, Tina , title =. AI Magazine , volume =. 2008 , doi =

  74. [74]

    and Sterling, Teague and Mysinger, Michael M

    Irwin, John J. and Sterling, Teague and Mysinger, Michael M. and Bolstad, Erin S. and Coleman, Ryan G. , title =. Journal of Chemical Information and Modeling , volume =

  75. [75]

    and Rupp, Matthias and von Lilienfeld, O

    Ramakrishnan, Raghunathan and Dral, Pavlo O. and Rupp, Matthias and von Lilienfeld, O. Anatole , title =. Scientific Data , volume =

  76. [76]

    ACM Transactions on Knowledge Discovery from Data (TKDD) , volume =

    Leskovec, Jure and Kleinberg, Jon and Faloutsos, Christos , title =. ACM Transactions on Knowledge Discovery from Data (TKDD) , volume =. 2007 , doi =

  77. [77]

    ACM Transactions on Information Systems (TOIS) , volume =

    Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and others , title =. ACM Transactions on Information Systems (TOIS) , volume =. 2025 , doi =

  78. [78]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and others , title =. arXiv preprint arXiv:2212.08073 , year =

  79. [79]

    Datasheets for Datasets , journal =

    Gebru, Timnit and Morgenstern, Jamie and Vecchione, Briana and Vaughan, Jennifer Wortman and Wallach, Hanna and Iii, Hal Daum. Datasheets for Datasets , journal =. 2021 , doi =

  80. [80]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

    Huang, Haoyu and Chen, Chong and Sheng, Zeang and Li, Yang and Zhang, Wentao , title =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =