COCORELI: Enforcing Execution Preconditions for Reliable Collaborative Instruction Following

Bastien Navarri; Eleni Metheniti; Lo\"ic Cabannes; Morteza Ezzabady; Nicholas Asher; Omar Naim; Swarnadeep Bhar

arxiv: 2509.04470 · v2 · submitted 2025-08-29 · 💻 cs.CL · cs.AI

COCORELI: Enforcing Execution Preconditions for Reliable Collaborative Instruction Following

Swarnadeep Bhar , Omar Naim , Eleni Metheniti , Bastien Navarri , Lo\"ic Cabannes , Morteza Ezzabady , Nicholas Asher This is my paper

Pith reviewed 2026-05-18 20:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords collaborative instruction followingunderspecificationagent reliabilityprecondition enforcementmodular task representationhallucinated actionsexecution blocking

0 comments

The pith

Cocoreli blocks execution until missing task details are clarified, eliminating hallucinated actions by architectural design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that detecting incomplete instructions is not enough for reliable agent behavior because models can still proceed to execute anyway. Cocoreli addresses this by representing tasks modularly so that unresolved specifications are treated as hard preconditions that prevent any action until clarified. This coupling of detection and blocking is built into the architecture rather than left to the model's reasoning. Tests in a controlled construction setting show the method stops execution under underspecification while chain-of-thought, prompt-chaining, and ReAct approaches do not. The same structure also supports task abstraction and extends to API workflow tasks.

Core claim

Cocoreli is a modular architecture that represents task structure, tracks missing information, and blocks execution until required details are resolved through targeted clarification. Detection and prevention are structurally coupled so that identifying a missing parameter immediately blocks further action, preventing execution under unresolved specifications.

What carries the argument

Modular task representation that identifies preconditions and enforces blocking of execution until they are resolved.

If this is right

Agents avoid performing incorrect or unsafe actions based on incomplete instructions.
The task representation enables abstraction and reuse across different instructions.
The approach generalizes from construction tasks to API workflow tasks on ToolBench.
Reliable collaborative execution depends on architectural enforcement of preconditions rather than model capability alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This precondition-blocking pattern could apply to robotic or software agents that must handle partial instructions without failing dangerously.
Combining the representation with stronger models might reduce cases of excessive blocking when preconditions are ambiguous.
Evaluating the method on longer, more open-ended instruction sequences would test whether the blocking mechanism scales without frequent interruptions.

Load-bearing premise

The modular task representation can identify all necessary preconditions comprehensively and accurately without missing critical details or blocking too often in realistic use.

What would settle it

A case where an agent using Cocoreli executes an action despite a critical unresolved precondition that the representation should have caught would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2509.04470 by Bastien Navarri, Eleni Metheniti, Lo\"ic Cabannes, Morteza Ezzabady, Nicholas Asher, Omar Naim, Swarnadeep Bhar.

**Figure 1.** Figure 1: COCORELI, our hybrid agentic system. multiple agents using LLM prompts. The workflow in such a system involves intermediary stages, resulting in better performance, adaptability, versatility, interpretability, and reliability in downstream tasks (Sudarshan et al., 2024). Agentic models may incorporate components that use and learn deterministic functions in context in a hybrid or tool learning approach … view at source ↗

**Figure 2.** Figure 2: Graphical representations of the Minecraft [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The B Shape [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: The Moroccan Bridge C.2 Task (v) A sample of human input to COCORELI instruction from the input for the C15 shape: - INST: Can you place a blue screw at row 4 column 5 height 1 - A: Place a blue screw at row 4 column 5 height 1 - INST: Place a red screw next to the blue screw, and put a red screw on top. - A: Place a red screw at row 4 column 6 height 1 - A: Place a red screw at row 4 column 6 height 2 - … view at source ↗

**Figure 7.** Figure 7: The Skull shape [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 5.** Figure 5: ) • D21: 6 parts (various) | 5 instructions • X34: 5 parts (various) | 5 instructions • Square: 16 parts (blue washers) | 17 instructions • Triad: 18 parts (red nuts) | 17 instructions • Face: 47 parts (blue nuts) • I: 18 parts (blue nuts) (see [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 8.** Figure 8: The skull that was recalled by GPT-4.1. The [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: The original I shape for Task (v) [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: The I shape that was recalled by GPT-4.1 for [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Autonomous agents executing human instructions must operate reliably even when instructions are incomplete. While recent approaches improve detection of missing information, detection alone is insufficient: agents often proceed to execution even after recognizing underspecification, leading to incorrect or unsafe actions. We identify this failure as arising from a lack of coupling between detection and execution, and propose that reliable behavior requires enforcing missing information as a precondition for action. We instantiate this principle in Cocoreli, a modular architecture that represents task structure, tracks missing information, and blocks execution until required details are resolved through targeted clarification. In Cocoreli, detection and prevention are structurally coupled: detecting a missing parameter simultaneously blocks execution. We evaluate Cocoreli in a controlled construction environment isolating underspecification and sequential execution. Cocoreli blocks execution under unresolved specifications by construction, eliminating hallucinated actions. In contrast, chain-of-thought, prompt-chaining, and ReAct-style reasoning may still execute under incomplete specifications despite high detection rates. The same representation supports abstraction and reuse, and generalizes to API workflow tasks on ToolBench. These results show that reliable collaborative execution requires architectural enforcement, not just model capability

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cocoreli couples missing-info detection directly to execution blocking through a modular task representation, which is a practical architectural fix but rests on unproven completeness of that representation outside controlled cases.

read the letter

Cocoreli's main contribution is the structural coupling: once the modular representation flags a missing precondition, execution is blocked by construction rather than left to the model's discretion. This directly addresses the gap where agents detect underspecification yet still produce actions anyway, as seen in the baselines they contrast against CoT, prompt-chaining, and ReAct-style loops. The controlled construction environment lets them isolate sequential execution and show the blocking works without hallucinated steps. The reuse and abstraction properties of the representation are a practical bonus, and the ToolBench generalization shows it can handle API workflow tasks without starting from scratch. That is real engineering progress on reliability for collaborative agents. The soft spot is exactly the one the stress-test flags. The whole enforcement benefit depends on the modular structure capturing every necessary precondition without gaps or excessive blocking. Their tests use structured, controlled cases, so they do not demonstrate coverage for implicit or context-dependent details that arise in open human instructions. If any precondition is omitted from the representation, detection fails and the architecture reduces to standard detection without the safety guarantee. No quantitative error analysis on missed preconditions or over-blocking appears in the reported results, which leaves the robustness claim thinner than the central principle. This is for researchers building agents that must collaborate safely with incomplete instructions, especially in workflow or task domains. Readers focused on architectural enforcement rather than pure prompting will find the mechanism worth examining. I would send it for peer review. The core idea is clear, the controlled evidence supports the coupling claim, and the work is grounded enough to benefit from referee input on broader validation.

Referee Report

2 major / 2 minor

Summary. The paper claims that reliable collaborative instruction following requires architectural coupling between detection of underspecification and execution prevention, rather than relying on model capability alone. It introduces COCORELI, a modular architecture that represents task structure, tracks missing parameters as preconditions, and blocks execution until clarification resolves them. This is contrasted with CoT, prompt-chaining, and ReAct baselines that may still execute despite detecting incompleteness. Evaluation occurs in a controlled construction environment isolating underspecification and sequential execution, with generalization shown on structured API workflow tasks from ToolBench. The central result is that COCORELI eliminates hallucinated actions by construction through this coupling.

Significance. If the architectural enforcement holds under the stated assumptions, the work provides a concrete demonstration that structural coupling can guarantee blocking where detection alone fails, addressing a practical gap in agent reliability for incomplete human instructions. The reuse and abstraction properties of the modular representation, plus ToolBench generalization, suggest potential for broader application in collaborative AI systems. The absence of fitted parameters and reliance on explicit task structure is a strength for interpretability.

major comments (2)

[Abstract and modular architecture description] The central claim that COCORELI blocks execution under unresolved specifications by construction (Abstract) depends on the modular task representation comprehensively encoding every necessary precondition. The controlled construction environment and ToolBench tests cover structured cases but do not demonstrate coverage of implicit or context-dependent preconditions that arise in less constrained instructions; if any precondition is omitted, detection fails and the enforcement guarantee reduces to standard detection without additional protection.
[Evaluation section] § on evaluation: while qualitative contrast with baselines is described, the abstract and available details provide no quantitative metrics, error analysis, or full evaluation details (e.g., rates of blocking success, over-blocking, or clarification efficiency). This makes it difficult to assess whether the architectural coupling delivers measurable reliability gains beyond the controlled setting.

minor comments (2)

[Architecture details] Clarify the exact interface between the modular representation and the underlying LLM or agent executor, including how clarification requests are generated and integrated without introducing new underspecification.
[Generalization paragraph] The generalization claim to ToolBench would benefit from explicit comparison of precondition coverage between the construction environment and API workflows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the scope of our claims and indicating revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and modular architecture description] The central claim that COCORELI blocks execution under unresolved specifications by construction (Abstract) depends on the modular task representation comprehensively encoding every necessary precondition. The controlled construction environment and ToolBench tests cover structured cases but do not demonstrate coverage of implicit or context-dependent preconditions that arise in less constrained instructions; if any precondition is omitted, detection fails and the enforcement guarantee reduces to standard detection without additional protection.

Authors: We agree that the enforcement guarantee holds only for preconditions explicitly represented in the task structure. Our evaluation targets structured domains where task structure is either given or derivable (controlled construction tasks and ToolBench API workflows). Implicit or context-dependent preconditions outside this representation would indeed reduce the system to detection-only behavior. We will revise the abstract and add a dedicated limitations paragraph to explicitly bound the claims and note this as a direction for future extensions of the representation. revision: yes
Referee: [Evaluation section] § on evaluation: while qualitative contrast with baselines is described, the abstract and available details provide no quantitative metrics, error analysis, or full evaluation details (e.g., rates of blocking success, over-blocking, or clarification efficiency). This makes it difficult to assess whether the architectural coupling delivers measurable reliability gains beyond the controlled setting.

Authors: The current manuscript emphasizes qualitative behavioral contrasts to isolate the effect of architectural coupling. We acknowledge that explicit quantitative reporting would strengthen the presentation. The evaluation section contains success and error counts from the controlled environment; we will expand it in revision to include tabulated metrics for blocking success rate, over-blocking rate, and clarification efficiency, together with a brief error analysis comparing COCORELI against the baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural coupling is a design choice, not a derived reduction

full rationale

The paper proposes Cocoreli as a modular architecture that represents task structure, tracks missing information, and blocks execution until details are resolved. The core claim that 'detection and prevention are structurally coupled: detecting a missing parameter simultaneously blocks execution' and 'Cocoreli blocks execution under unresolved specifications by construction' follows directly from the stated design of the system rather than from any equation, fitted parameter, or prior result that would render the outcome tautological. No self-referential equations, predictions of fitted quantities, or load-bearing self-citations appear in the provided text. The evaluation on a controlled construction environment and ToolBench generalization serves as external testing rather than internal re-derivation. This is a standard architectural proposal whose central result is the proposed coupling itself, with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view limits visibility; the approach rests on assumptions about accurate task modeling and clarification feasibility.

axioms (1)

domain assumption Task structure can be represented modularly to track and resolve all missing parameters.
Invoked to enable blocking until preconditions are met.

invented entities (1)

COCORELI modular architecture no independent evidence
purpose: Enforce execution preconditions by coupling detection and blocking.
New system component introduced to solve the identified failure mode.

pith-pipeline@v0.9.0 · 5758 in / 1064 out tokens · 42355 ms · 2026-05-18T20:44:36.432281+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

The Llama 3 Herd of Models

A survey on in-context learning. In Proceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, Miami, Florida, USA. Association for Computational Linguistics. Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Gra- ham Neubig. 2023. PAL: Program-aided language models. I...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

ArXiv:2411.11465 [cs.LG]

Re-examining learning linear functions in con- text. Preprint, arXiv:2411.11465. Anjali Narayan-Chen, Prashant Jayannavar, and Ju- lia Hockenmaier. 2019. Collaborative Dialogue in Minecraft. In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguistics, pages 5405–5415, Florence, Italy. Association for Computational Linguist...

work page arXiv 2019
[3]

In The Twelfth International Conference on Learning Representations

Smartplay : A benchmark for LLMs as intelli- gent agents. In The Twelfth International Conference on Learning Representations. Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See-Kiong Ng, and Jiashi Feng

work page
[4]

ReAct: Synergizing Reasoning and Acting in Language Models

MAgIC: Investigation of large language model powered multi-agent in cognition, adaptability, ratio- nality and collaboration. In Proceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 7315–7332, Miami, Florida, USA. Association for Computational Linguistics. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Gri...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Claude-3.5-Sonnet (175 billion parameters) API from the official API: https://claude.ai/ new

work page
[6]

screw”, “nut

GPT-4.1 (unknown parameter size) from the of- ficial API: https://platform.openai.com/ docs/models/gpt-4.1 For the Agentic Model, we used Llama-3.3-70b- Instruct, through the AI/ML API service: https: //aimlapi.com/app/. For the design and implementation of CO- CORELI, we used LLaMA-3.1 8b from the Ol- lama library (https://ollama.com), and from the trans...

work page
[7]

Include relevant details (e.g., part type, color, and coordinates) in plain language

∗∗Known Structures: ∗∗ - If the instruction refers to structures that correspond to existing tools (such as those for placing or removing parts), provide a brief explanation of how these actions will be performed. Include relevant details (e.g., part type, color, and coordinates) in plain language

work page
[8]

- A list of key parameters required to define the structure (for example, dimensions, orienta- tion, and position)

∗∗New Structures: ∗∗ - If the instruction implies a new or custom struc- ture not covered by existing tools, generate a clear, descriptive plan that includes: - A concise description of the new structure. - A list of key parameters required to define the structure (for example, dimensions, orienta- tion, and position). - A proposed name for the structure ...

work page
[9]

Identify patterns in the placement sequence

work page
[10]

Group similar placements together

work page
[11]

Use loops and helper functions to minimize code duplication

work page
[12]

Builds a structure with the given parameters

Make the code reusable and parameterized ∗∗Input Format:∗∗ You will receive a sequence of place actions in the format: `Place a [color] [part] at row [y] column [x] height [z]` ∗∗Output Format:∗∗ Generate a Python function that recreates the struc- ture. The function should: - Accept parameters for customization (color, po- sition, etc.) - Use helper func...

work page 1985

[1] [1]

The Llama 3 Herd of Models

A survey on in-context learning. In Proceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, Miami, Florida, USA. Association for Computational Linguistics. Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Gra- ham Neubig. 2023. PAL: Program-aided language models. I...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

ArXiv:2411.11465 [cs.LG]

Re-examining learning linear functions in con- text. Preprint, arXiv:2411.11465. Anjali Narayan-Chen, Prashant Jayannavar, and Ju- lia Hockenmaier. 2019. Collaborative Dialogue in Minecraft. In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguistics, pages 5405–5415, Florence, Italy. Association for Computational Linguist...

work page arXiv 2019

[3] [3]

In The Twelfth International Conference on Learning Representations

Smartplay : A benchmark for LLMs as intelli- gent agents. In The Twelfth International Conference on Learning Representations. Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See-Kiong Ng, and Jiashi Feng

work page

[4] [4]

ReAct: Synergizing Reasoning and Acting in Language Models

MAgIC: Investigation of large language model powered multi-agent in cognition, adaptability, ratio- nality and collaboration. In Proceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 7315–7332, Miami, Florida, USA. Association for Computational Linguistics. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Gri...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Claude-3.5-Sonnet (175 billion parameters) API from the official API: https://claude.ai/ new

work page

[6] [6]

screw”, “nut

GPT-4.1 (unknown parameter size) from the of- ficial API: https://platform.openai.com/ docs/models/gpt-4.1 For the Agentic Model, we used Llama-3.3-70b- Instruct, through the AI/ML API service: https: //aimlapi.com/app/. For the design and implementation of CO- CORELI, we used LLaMA-3.1 8b from the Ol- lama library (https://ollama.com), and from the trans...

work page

[7] [7]

Include relevant details (e.g., part type, color, and coordinates) in plain language

∗∗Known Structures: ∗∗ - If the instruction refers to structures that correspond to existing tools (such as those for placing or removing parts), provide a brief explanation of how these actions will be performed. Include relevant details (e.g., part type, color, and coordinates) in plain language

work page

[8] [8]

- A list of key parameters required to define the structure (for example, dimensions, orienta- tion, and position)

∗∗New Structures: ∗∗ - If the instruction implies a new or custom struc- ture not covered by existing tools, generate a clear, descriptive plan that includes: - A concise description of the new structure. - A list of key parameters required to define the structure (for example, dimensions, orienta- tion, and position). - A proposed name for the structure ...

work page

[9] [9]

Identify patterns in the placement sequence

work page

[10] [10]

Group similar placements together

work page

[11] [11]

Use loops and helper functions to minimize code duplication

work page

[12] [12]

Builds a structure with the given parameters

Make the code reusable and parameterized ∗∗Input Format:∗∗ You will receive a sequence of place actions in the format: `Place a [color] [part] at row [y] column [x] height [z]` ∗∗Output Format:∗∗ Generate a Python function that recreates the struc- ture. The function should: - Accept parameters for customization (color, po- sition, etc.) - Use helper func...

work page 1985