ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents

Chan Huah Yong; Zhou Hanlin

arxiv: 2604.25849 · v1 · submitted 2026-04-28 · 💻 cs.AI

ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents

Zhou Hanlin , Chan Huah Yong This is my paper

Pith reviewed 2026-05-07 16:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords knowledge-state orchestrationlong-horizon LLM tasksLLM agentsepistemic bookkeepingartifact progressioncheckpoint resumptionmulti-agent systemsknowledge synthesis

0 comments

The pith

Long-horizon LLM tasks succeed when architectures enforce explicit knowledge-state transitions, artifact progression, and resumable continuity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ADEMA as an architecture for managing extended knowledge synthesis tasks with LLM agents. It identifies the core problems as drifting knowledge states across rounds, implicit intermediate commitments, and broken evidence chains after interruptions. The design counters these by combining explicit epistemic bookkeeping, dual-evaluator governance, adaptive switching, reputation-based allocation, checkpoint-resumable persistence, memory condensation, artifact-first assembly, and validity checking. Evidence comes from a four-scenario showcase and a 60-run matrix where only the absence of checkpoint/resume produced invalid results in interruption conditions. A sympathetic reader would care because the approach turns unreliable multi-round agent work into a recoverable process that preserves evidence integrity.

Core claim

ADEMA is a knowledge-state orchestration architecture for long-horizon knowledge synthesis with LLM agents. It integrates explicit epistemic bookkeeping, heterogeneous dual-evaluator governance, adaptive task-mode switching, reputation-shaped resource allocation, checkpoint-resumable persistence, segment-level memory condensation, artifact-first assembly, and final-validity checking with safe fallback. Across the fixed 60-run matrix, removing checkpoint/resume produced the only invalid run in the interruption-sensitive condition. Dual evaluation, segment synthesis, and dynamic governance function as supporting controls that shape trajectory discipline and cost-quality behavior rather than as

What carries the argument

The ADEMA architecture, which carries the argument by making epistemic state transitions explicit, progressing evidence-bearing artifacts, and enabling recoverable continuity through bookkeeping, dual evaluation, and resumable checkpoints.

If this is right

Checkpoint/resume is the only mechanism whose removal produced invalid runs in interruption-sensitive conditions.
Dual evaluation and dynamic governance shape trajectory discipline and cost-quality behavior rather than serving as binary prerequisites for task completion.
Artifact-first assembly and segment-level condensation maintain progressing evidence chains without relying on implicit LLM memory alone.
The architecture allows safe fallback and final-validity checking to recover from partial failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the mechanisms scale, similar explicit-state tracking could be added to existing multi-agent frameworks to reduce drift without full redesign.
The results imply that implicit memory alone is insufficient for complex synthesis tasks longer than a few rounds, favoring external orchestration.
Testing on tasks exceeding the current matrix length could show whether condensation and resumption continue to prevent fractures at greater scales.

Load-bearing premise

The combination of explicit epistemic bookkeeping, checkpoint-resumable persistence, and artifact-first assembly will reliably prevent knowledge-state drift and fractured evidence chains in long-horizon LLM tasks.

What would settle it

Running the same long-horizon scenarios without checkpoint/resume or artifact-first assembly and observing no increase in invalid outputs or evidence fractures would falsify the necessity of these mechanisms.

read the original abstract

Long-horizon LLM tasks often fail not because a single answer is unattainable, but because knowledge states drift across rounds, intermediate commitments remain implicit, and interruption fractures the evolving evidence chain. This paper presents ADEMA as a knowledge-state orchestration architecture for long-horizon knowledge synthesis rather than as a generic multi-agent runtime. The architecture combines explicit epistemic bookkeeping, heterogeneous dual-evaluator governance, adaptive task-mode switching, reputation-shaped resource allocation, checkpoint-resumable persistence, segment-level memory condensation, artifact-first assembly, and final-validity checking with safe fallback. Evidence is drawn entirely from existing materials: a four-scenario showcase package, a fixed 60-run mechanism matrix, targeted micro-ablation and artifact-chain supplements, and a repaired protocol-level benchmark in which code-oriented evaluation is the clearest quality-sensitive mechanism block. Across the fixed matrix, removing checkpoint/resume produced the only invalid run, and it did so in the interruption-sensitive resume condition. By contrast, dual evaluation, segment synthesis, and dynamic governance are best interpreted as supporting control mechanisms that shape trajectory discipline, explicit artifact progression, and cost-quality behavior rather than as universal binary prerequisites for completion. The contribution is therefore a knowledge-state orchestration architecture in which explicit epistemic state transition, evidence-bearing artifact progression, and recoverable continuity are the primary design commitments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ADEMA makes knowledge-state management explicit for long LLM runs and the 60-run matrix flags checkpointing as critical under interruptions, but the tests do not directly measure drift in complete cases.

read the letter

The paper's core move is to treat knowledge-state drift and fractured evidence chains as the main failure mode in long-horizon LLM synthesis, then build an architecture around explicit epistemic bookkeeping, dual-evaluator checks, checkpoint-resumable persistence, and artifact-first assembly. That framing is clearer than most agent papers that just add more tools or memory without naming the state problem directly. The fixed 60-run mechanism matrix and four-scenario showcase are the main evidence, and they show that removing checkpoint and resume is the only change that produces an invalid run in the interruption condition. The other mechanisms appear more as shaping tools than hard requirements for basic completion. This gives a practical negative result worth noting for anyone building resumable agent pipelines. The work is honest about drawing from existing materials and a repaired benchmark rather than claiming new data collection. The architecture description itself is systematic and lists the components without overclaiming universality. Soft spots sit in the evaluation scope. The matrix tests combinations but does not report direct metrics for knowledge-state consistency or evidence-chain integrity across full uninterrupted long-horizon runs, so the claim that the full combination reliably prevents drift rests more on design intent than on measured bounds. No error bars, open code, or data are referenced in the abstract, which limits how far the 60-run results can be taken. The repaired benchmark also means some of the quality signal comes from protocol fixes rather than the architecture alone. This paper is for people who already run multi-step LLM agents in research or automation and need a concrete way to make state and recovery first-class. Readers looking for a checklist of mechanisms and a small empirical matrix on interruptions will find it useful. It is not a foundational theoretical result, but the problem it targets is common enough that the explicit orchestration focus adds value. I would send it to peer review. The central idea is coherent and the matrix gives at least one clear finding, so referees can usefully press on the measurement gaps and generalization without the paper being desk-rejected.

Referee Report

2 major / 2 minor

Summary. The paper presents ADEMA as a knowledge-state orchestration architecture for long-horizon LLM agent tasks, designed to address knowledge-state drift, implicit commitments, and fractured evidence chains. It integrates explicit epistemic bookkeeping, heterogeneous dual-evaluator governance, adaptive task-mode switching, reputation-shaped resource allocation, checkpoint-resumable persistence, segment-level memory condensation, artifact-first assembly, and final-validity checking. Claims are supported by a four-scenario showcase package, a fixed 60-run mechanism matrix (where only checkpoint/resume removal produced an invalid run under interruption), targeted micro-ablations, and a repaired protocol-level benchmark, with the primary design commitments identified as explicit epistemic state transition, evidence-bearing artifact progression, and recoverable continuity.

Significance. If the central claims hold under more direct validation, ADEMA would provide a structured orchestration framework that prioritizes epistemic continuity and artifact integrity over generic multi-agent runtimes, potentially improving reliability in extended LLM synthesis workflows. The transparent sourcing from existing materials and emphasis on recoverable persistence represent strengths, though the limited scope of the 60-run matrix and absence of drift-specific metrics constrain broader impact assessment.

major comments (2)

The 60-run mechanism matrix and four-scenario showcase (as described in the abstract) demonstrate that checkpoint/resume removal is the only condition producing an invalid run in the interruption-sensitive case, but provide no direct state-consistency metrics, evidence-chain integrity scores, or drift bounds for uninterrupted long-horizon executions; this leaves the claim that the full combination of epistemic bookkeeping, artifact progression, and recoverable continuity reliably prevents drift under-supported for the general case.
The abstract interprets dual evaluation, segment synthesis, and dynamic governance as shaping control mechanisms rather than binary prerequisites, yet without reported quantitative comparisons (e.g., trajectory discipline or cost-quality deltas) or details on the repaired benchmark's baseline, the load-bearing distinction between these components and the primary commitments cannot be fully evaluated.

minor comments (2)

The abstract provides no error bars, variance measures, or statistical details for the fixed 60-run matrix, reducing clarity on result robustness.
No reference to open data, code, or reproduction artifacts is mentioned, which limits independent verification of the showcase and matrix results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for identifying areas where the evidential support for our claims can be strengthened. We address each major comment below and outline targeted revisions to the manuscript.

read point-by-point responses

Referee: The 60-run mechanism matrix and four-scenario showcase (as described in the abstract) demonstrate that checkpoint/resume removal is the only condition producing an invalid run in the interruption-sensitive case, but provide no direct state-consistency metrics, evidence-chain integrity scores, or drift bounds for uninterrupted long-horizon executions; this leaves the claim that the full combination of epistemic bookkeeping, artifact progression, and recoverable continuity reliably prevents drift under-supported for the general case.

Authors: We agree that the current evaluation does not supply direct quantitative drift bounds or state-consistency scores for uninterrupted long-horizon runs. The 60-run matrix was designed to isolate the effect of mechanism removal under interruption stress, and the four-scenario showcase provides qualitative evidence of successful epistemic continuity in complete executions. The architectural claim rests on the premise that explicit epistemic bookkeeping combined with artifact-first assembly prevents drift by construction. In revision we will add a new subsection that (a) formalizes the drift-prevention invariants implied by the state-transition rules and (b) maps specific showcase traces to those invariants, thereby making the general-case argument more explicit while acknowledging the absence of numerical drift metrics. revision: yes
Referee: The abstract interprets dual evaluation, segment synthesis, and dynamic governance as shaping control mechanisms rather than binary prerequisites, yet without reported quantitative comparisons (e.g., trajectory discipline or cost-quality deltas) or details on the repaired benchmark's baseline, the load-bearing distinction between these components and the primary commitments cannot be fully evaluated.

Authors: We accept that the manuscript currently lacks side-by-side quantitative comparisons (trajectory discipline, cost-quality deltas) that would directly contrast the shaping versus prerequisite roles. The positioning of these components as control mechanisms is derived from the mechanism-matrix outcomes and the repaired benchmark, where code-oriented evaluation was the dominant quality-sensitive block. In revision we will (i) expand the benchmark description to include the original protocol baseline and any observable differences in run trajectories when the control mechanisms are ablated, and (ii) add a short paragraph clarifying the evidential basis for treating them as shaping rather than binary. Because new head-to-head experiments are not feasible within the current experimental budget, the revision will remain interpretive rather than adding fresh numerical deltas. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture claims rest on independent experimental matrix rather than self-referential reduction

full rationale

The paper introduces ADEMA as a knowledge-state orchestration architecture whose primary commitments (explicit epistemic state transition, evidence-bearing artifact progression, recoverable continuity) are validated through a four-scenario showcase, fixed 60-run mechanism matrix, and targeted ablations. Removing checkpoint/resume produced the sole invalid run under interruption conditions, while other mechanisms are interpreted as shaping controls. No equations, fitted parameters presented as predictions, self-citations, or uniqueness theorems appear in the provided text. The evidence consists of direct empirical runs on existing materials rather than any derivation that reduces by construction to the architecture definition itself. The chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The architecture rests on the domain assumption that LLM agents suffer from implicit knowledge drift that can be mitigated by explicit state tracking and persistence mechanisms. No free parameters or invented physical entities are introduced; the design choices function as engineering commitments rather than fitted constants.

axioms (1)

domain assumption LLM agents in long-horizon tasks experience knowledge-state drift and fractured evidence chains that explicit bookkeeping and checkpoints can address.
Invoked throughout the abstract as the motivation for the eight listed mechanisms and the primary design commitments.

pith-pipeline@v0.9.0 · 5534 in / 1508 out tokens · 46784 ms · 2026-05-07T16:01:43.003899+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

A survey on large language model based autonomous agents,

L. Wang et al., "A survey on large language model based autonomous agents," vol. 18, no. 6, p. 186345, 2024

work page 2024
[2]

Large language models for software engineering: A systematic literature review,

X. Hou et al., "Large language models for software engineering: A systematic literature review," vol. 33, no. 8, pp. 1-79, 2024

work page 2024
[3]

Mallm: Multi-agent large language models framework,

J. Becker, L. B. Kaesberg, N. Bauer, J. P. Wahle, T. Ruas, and B. Gipp, "Mallm: Multi-agent large language models framework," in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2025, pp. 418-439

work page 2025
[4]

Camel: Communicative agents for

G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. J. A. i. n. i. p. s. Ghanem, "Camel: Communicative agents for" mind" exploration of large language model society," vol. 36, pp. 51991- 52008, 2023

work page 2023
[5]

AutoGen: Enabling next-gen LLM applications via multi-agent conversation,

Y. Wu, W. Wang, and J. Zhang, "AutoGen: Enabling next-gen LLM applications via multi-agent conversation," 2023

work page 2023
[6]

React: Synergizing reasoning and acting in language models,

S. Yao et al., "React: Synergizing reasoning and acting in language models," in The eleventh international conference on learning representations, 2022

work page 2022
[7]

Self-refine: iterative refinement with self-feedback , isbn =

U. Alon et al., "Self-Refine: Iterative Refinement with Self-Feedback," presented at the Advances in Neural Information Processing Systems 36, 2023. Available: https://doi.org/10.52202/075280-2019

work page doi:10.52202/075280-2019 2023
[8]

Swe-bench: Can language models resolve real-world github issues?,

C. E. Jimenez et al., "Swe-bench: Can language models resolve real-world github issues?," 2023

work page 2023
[9]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

J. Gonzalez, S. Patil, X. Wang, and T. Zhang, "Gorilla: Large Language Model Connected with Massive APIs," presented at the Advances in Neural Information Processing Systems 37, 2024. Available: https://doi.org/10.52202/079017-4020

work page doi:10.52202/079017-4020 2024
[10]

Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection," presented at the Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 2023. Available: https://doi.org/10.1145/3605764.3623985

work page doi:10.1145/3605764.3623985 2023
[11]

SLOT: Structuring the Output of Large Language Models,

Z. Shen, D. Y.-B. Wang, S. S. Mishra, Z. Xu, Y. Teng, and H. Ding, "SLOT: Structuring the Output of Large Language Models," presented at the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025. Available: https://doi.org/10.18653/v1/2025.emnlp-industry.32

work page doi:10.18653/v1/2025.emnlp-industry.32 2025
[12]

Knowledge sharing in manufacturing using LLM-powered tools: user study and model benchmarking,

S. K. Freire, C. Wang, M. Foosherian, S. Wellsandt, S. Ruiz-Arenas, and E. Niforatos, "Knowledge sharing in manufacturing using LLM-powered tools: user study and model benchmarking," Frontiers in Artificial Intelligence, 2024

work page 2024
[13]

Advancing the search frontier with AI agents,

R. W. J. C. o. t. A. White, "Advancing the search frontier with AI agents," vol. 67, no. 9, pp. 54-65, 2024

work page 2024
[14]

Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?,

X. Tang et al., "Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?," presented at the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), 2024. Available: https://doi.org/10.18653/v1/2024.naacl-short.2

work page doi:10.18653/v1/2024.naacl-short.2 2024
[15]

Do large language models have a legal duty to tell the truth?,

S. Wachter, B. Mittelstadt, and C. J. R. S. O. S. Russell, "Do large language models have a legal duty to tell the truth?," vol. 11, no. 8, p. 240197, 2024

work page 2024
[16]

L ong LLML ingua: Accelerating and Enhancing LLM s in Long Context Scenarios via Prompt Compression

H. Jiang et al., "LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression," presented at the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024. Available: https://doi.org/10.18653/v1/2024.acl-long.91

work page doi:10.18653/v1/2024.acl-long.91 2024
[17]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face,

D. Li, W. Lu, Y. Shen, K. Song, X. Tan, and Y. Zhuang, "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face," presented at the Advances in Neural Information Processing Systems 36, 2023. Available: https://doi.org/10.52202/075280-1657

work page doi:10.52202/075280-1657 2023
[18]

How mature is requirements engineering for AI-based systems? A systematic mapping study on practices, challenges, and future research directions,

U.-e.-. Habiba, M. Haug, J. Bogner, and S. J. R. E. Wagner, "How mature is requirements engineering for AI-based systems? A systematic mapping study on practices, challenges, and future research directions," vol. 29, no. 4, pp. 567-600, 2024

work page 2024
[19]

Formal requirements engineering and large language models: A two-way roadmap,

A. Ferrari, P. J. I. Spoletini, and S. Technology, "Formal requirements engineering and large language models: A two-way roadmap," vol. 181, p. 107697, 2025

work page 2025
[20]

A survey on large language models for code generation,

J. Jiang, F. Wang, J. Shen, S. Kim, S. J. A. T. o. S. E. Kim, and Methodology, "A survey on large language models for code generation," vol. 35, no. 2, pp. 1-72, 2026

work page 2026
[21]

Design principles and guidelines for llm observability: Insights from developers,

X. Chen, Y. Li, and X. Wang, "Design principles and guidelines for llm observability: Insights from developers," in Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2025, pp. 1-9

work page 2025
[22]

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations,

M. T. R. Laskar et al., "A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations," in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 13785-13816

work page 2024
[23]

Learning to generate structured output with schema reinforcement learning

Y. Lu et al., "Learning to Generate Structured Output with Schema Reinforcement Learning," presented at the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. Available: https://doi.org/10.18653/v1/2025.acl-long.243

work page doi:10.18653/v1/2025.acl-long.243 2025
[24]

Large language models for constructing and optimizing machine learning workflows: A survey,

Y. Gu et al., "Large language models for constructing and optimizing machine learning workflows: A survey," 2025

work page 2025
[25]

Evaluating large language models for software testing,

Y. Li, P. Liu, H. Wang, J. Chu, W. E. J. C. S. Wong, and Interfaces, "Evaluating large language models for software testing," vol. 93, p. 103942, 2025

work page 2025
[26]

LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology,

R. Souza et al., "LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology," in Proceedings of the SC'25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 2257-2268

work page 2025

[1] [1]

A survey on large language model based autonomous agents,

L. Wang et al., "A survey on large language model based autonomous agents," vol. 18, no. 6, p. 186345, 2024

work page 2024

[2] [2]

Large language models for software engineering: A systematic literature review,

X. Hou et al., "Large language models for software engineering: A systematic literature review," vol. 33, no. 8, pp. 1-79, 2024

work page 2024

[3] [3]

Mallm: Multi-agent large language models framework,

J. Becker, L. B. Kaesberg, N. Bauer, J. P. Wahle, T. Ruas, and B. Gipp, "Mallm: Multi-agent large language models framework," in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2025, pp. 418-439

work page 2025

[4] [4]

Camel: Communicative agents for

G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. J. A. i. n. i. p. s. Ghanem, "Camel: Communicative agents for" mind" exploration of large language model society," vol. 36, pp. 51991- 52008, 2023

work page 2023

[5] [5]

AutoGen: Enabling next-gen LLM applications via multi-agent conversation,

Y. Wu, W. Wang, and J. Zhang, "AutoGen: Enabling next-gen LLM applications via multi-agent conversation," 2023

work page 2023

[6] [6]

React: Synergizing reasoning and acting in language models,

S. Yao et al., "React: Synergizing reasoning and acting in language models," in The eleventh international conference on learning representations, 2022

work page 2022

[7] [7]

Self-refine: iterative refinement with self-feedback , isbn =

U. Alon et al., "Self-Refine: Iterative Refinement with Self-Feedback," presented at the Advances in Neural Information Processing Systems 36, 2023. Available: https://doi.org/10.52202/075280-2019

work page doi:10.52202/075280-2019 2023

[8] [8]

Swe-bench: Can language models resolve real-world github issues?,

C. E. Jimenez et al., "Swe-bench: Can language models resolve real-world github issues?," 2023

work page 2023

[9] [9]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

J. Gonzalez, S. Patil, X. Wang, and T. Zhang, "Gorilla: Large Language Model Connected with Massive APIs," presented at the Advances in Neural Information Processing Systems 37, 2024. Available: https://doi.org/10.52202/079017-4020

work page doi:10.52202/079017-4020 2024

[10] [10]

Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection," presented at the Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 2023. Available: https://doi.org/10.1145/3605764.3623985

work page doi:10.1145/3605764.3623985 2023

[11] [11]

SLOT: Structuring the Output of Large Language Models,

Z. Shen, D. Y.-B. Wang, S. S. Mishra, Z. Xu, Y. Teng, and H. Ding, "SLOT: Structuring the Output of Large Language Models," presented at the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025. Available: https://doi.org/10.18653/v1/2025.emnlp-industry.32

work page doi:10.18653/v1/2025.emnlp-industry.32 2025

[12] [12]

Knowledge sharing in manufacturing using LLM-powered tools: user study and model benchmarking,

S. K. Freire, C. Wang, M. Foosherian, S. Wellsandt, S. Ruiz-Arenas, and E. Niforatos, "Knowledge sharing in manufacturing using LLM-powered tools: user study and model benchmarking," Frontiers in Artificial Intelligence, 2024

work page 2024

[13] [13]

Advancing the search frontier with AI agents,

R. W. J. C. o. t. A. White, "Advancing the search frontier with AI agents," vol. 67, no. 9, pp. 54-65, 2024

work page 2024

[14] [14]

Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?,

X. Tang et al., "Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?," presented at the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), 2024. Available: https://doi.org/10.18653/v1/2024.naacl-short.2

work page doi:10.18653/v1/2024.naacl-short.2 2024

[15] [15]

Do large language models have a legal duty to tell the truth?,

S. Wachter, B. Mittelstadt, and C. J. R. S. O. S. Russell, "Do large language models have a legal duty to tell the truth?," vol. 11, no. 8, p. 240197, 2024

work page 2024

[16] [16]

L ong LLML ingua: Accelerating and Enhancing LLM s in Long Context Scenarios via Prompt Compression

H. Jiang et al., "LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression," presented at the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024. Available: https://doi.org/10.18653/v1/2024.acl-long.91

work page doi:10.18653/v1/2024.acl-long.91 2024

[17] [17]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face,

D. Li, W. Lu, Y. Shen, K. Song, X. Tan, and Y. Zhuang, "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face," presented at the Advances in Neural Information Processing Systems 36, 2023. Available: https://doi.org/10.52202/075280-1657

work page doi:10.52202/075280-1657 2023

[18] [18]

How mature is requirements engineering for AI-based systems? A systematic mapping study on practices, challenges, and future research directions,

U.-e.-. Habiba, M. Haug, J. Bogner, and S. J. R. E. Wagner, "How mature is requirements engineering for AI-based systems? A systematic mapping study on practices, challenges, and future research directions," vol. 29, no. 4, pp. 567-600, 2024

work page 2024

[19] [19]

Formal requirements engineering and large language models: A two-way roadmap,

A. Ferrari, P. J. I. Spoletini, and S. Technology, "Formal requirements engineering and large language models: A two-way roadmap," vol. 181, p. 107697, 2025

work page 2025

[20] [20]

A survey on large language models for code generation,

J. Jiang, F. Wang, J. Shen, S. Kim, S. J. A. T. o. S. E. Kim, and Methodology, "A survey on large language models for code generation," vol. 35, no. 2, pp. 1-72, 2026

work page 2026

[21] [21]

Design principles and guidelines for llm observability: Insights from developers,

X. Chen, Y. Li, and X. Wang, "Design principles and guidelines for llm observability: Insights from developers," in Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2025, pp. 1-9

work page 2025

[22] [22]

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations,

M. T. R. Laskar et al., "A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations," in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 13785-13816

work page 2024

[23] [23]

Learning to generate structured output with schema reinforcement learning

Y. Lu et al., "Learning to Generate Structured Output with Schema Reinforcement Learning," presented at the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. Available: https://doi.org/10.18653/v1/2025.acl-long.243

work page doi:10.18653/v1/2025.acl-long.243 2025

[24] [24]

Large language models for constructing and optimizing machine learning workflows: A survey,

Y. Gu et al., "Large language models for constructing and optimizing machine learning workflows: A survey," 2025

work page 2025

[25] [25]

Evaluating large language models for software testing,

Y. Li, P. Liu, H. Wang, J. Chu, W. E. J. C. S. Wong, and Interfaces, "Evaluating large language models for software testing," vol. 93, p. 103942, 2025

work page 2025

[26] [26]

LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology,

R. Souza et al., "LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology," in Proceedings of the SC'25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 2257-2268

work page 2025