DAR: Deontic Reasoning with Agentic Harnesses

Benjamin Van Durme; Guangyao Dou; Nils Holzenberger; William Jurayj

arxiv: 2606.05009 · v1 · pith:DTKOFLC4new · submitted 2026-06-03 · 💻 cs.CL · cs.AI

DAR: Deontic Reasoning with Agentic Harnesses

Guangyao Dou , William Jurayj , Nils Holzenberger , Benjamin Van Durme This is my paper

Pith reviewed 2026-06-28 05:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords deontic reasoningagentic reasoningLLM reasoningstatutesDeonticBenchnumerical tasksrule application

0 comments

The pith

Agentic harnesses let LLMs query statutes on demand and improve deontic reasoning on some tasks but not uniformly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Deontic Agentic Reasoning (DAR) as a setup where models actively interact with statutes rather than receiving full rule sets upfront. It tests this approach across multiple harnesses on hard subsets of DeonticBench. Results show that agentic methods can advance performance on deontic tasks such as tax or immigration rule application. Gains vary by model strength, with weaker models often losing ground on numerical subtasks and using substantially more tokens. The work treats the interaction mechanism itself as the variable that determines whether rules are located and applied correctly.

Core claim

Deontic Agentic Reasoning (DAR) is an agentic reasoning setup in which the model interacts with the statutes on demand. Across multiple harnesses on hard subsets of DeonticBench, agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

What carries the argument

Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand to locate and apply relevant rules.

If this is right

Agentic interaction expands the set of deontic problems LLMs can solve when rule sets are long and cross-referenced.
Performance differences between models become more pronounced under agentic harnesses than under standard prompting.
Numerical deontic subtasks remain especially sensitive to model scale when using on-demand rule access.
Token budgets increase substantially with agentic methods, creating a cost-performance trade-off.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same on-demand interaction pattern could be applied to other domains that rely on large, cross-referenced rule collections such as regulatory compliance or contract interpretation.
Token-efficiency optimizations or hybrid retrieval-plus-agent designs might reduce the cost penalty observed for weaker models.
If the interaction harness is made more structured, the gap between strong and weak models on numerical tasks might narrow.

Load-bearing premise

Allowing the model to interact with the statutes on demand enables it to locate and apply the correct rules more effectively than standard prompting without introducing new errors or excessive costs.

What would settle it

If DAR versions of the same models produce lower accuracy than direct-prompt baselines on the same DeonticBench subsets, especially on numerical tasks, the claim that agentic harnesses push the frontier would not hold.

Figures

Figures reproduced from arXiv: 2606.05009 by Benjamin Van Durme, Guangyao Dou, Nils Holzenberger, William Jurayj.

**Figure 2.** Figure 2: Harness comparison across DeonticBench. Qwen3-Coder-480B, Qwen3-235B-A22B (Yang et al., 2025), and Kimi K2 0905 from moonshot (Team et al., 2025). For proprietary models, we evaluate OpenAI GPT-5.1 and GPT-5.2 (Singh et al., 2025) with reasoning effort set to none, and Claude Sonnet 4.5 (Anthropic, 2025). Open-source models are served via vLLM (Kwon et al., 2023) or accessed through the OpenRouter API. 3.3… view at source ↗

**Figure 3.** Figure 3: Average tokens consumed per trial under Di [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAR tests agentic harnesses on deontic tasks and reports uneven gains plus higher costs for weaker models.

read the letter

The main point is that DAR lets models query statutes interactively during reasoning and this sometimes improves results on hard DeonticBench subsets, though weaker models often get worse on numerical items and use more tokens.

The paper takes the existing agentic harness idea and applies it to deontic reasoning where rulesets are long and cross-referenced. That is a direct response to a practical failure mode in standard prompting. Running the same setup under multiple harnesses on the hard subsets gives a basic check on whether the benefit holds across variations, which is useful.

The work is clear about the motivation and honest about the non-uniform outcomes. Deontic tasks appear in law and policy, so the framing connects to real use cases.

The soft spot is the reporting. The abstract states the high-level finding without baselines, numbers, error bars, or any description of the harness differences or statistical checks. That leaves the size of the claimed frontier push and the degradation effect hard to evaluate from what is shown. The token cost observation is noted but not quantified either.

This is for people working on LLM agents for rule-based domains. A reader focused on legal or regulatory applications could pick up the harness comparison and the limits on weaker models.

It deserves a serious referee because the problem is well-posed and the method is straightforward to test. I would send it for review if the full experiments include proper controls and comparisons.

Referee Report

1 major / 0 minor

Summary. The paper introduces Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand to address the challenge of locating relevant rules in long, cross-referenced rulesets for tasks such as tax liability or immigration appeals. It evaluates DAR under multiple harnesses on hard subsets of DeonticBench and reports that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

Significance. If the empirical results hold after proper validation, the work could be significant for advancing LLM-based deontic reasoning by demonstrating the value (and limitations) of dynamic, on-demand interaction with rule sets rather than static prompting. The non-uniformity finding across model strengths and the token-cost observation would be useful for practitioners deploying such systems in legal or policy domains. The paper is purely empirical with no equations, derivations, or fitted parameters.

major comments (1)

Abstract: The abstract states high-level findings without experimental details, baselines, error bars, statistical tests, or data descriptions, so it is impossible to determine whether the reported results actually support the claim that agentic harnesses push the frontier on deontic reasoning tasks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below and will revise the manuscript accordingly where appropriate.

read point-by-point responses

Referee: Abstract: The abstract states high-level findings without experimental details, baselines, error bars, statistical tests, or data descriptions, so it is impossible to determine whether the reported results actually support the claim that agentic harnesses push the frontier on deontic reasoning tasks.

Authors: We agree that the abstract, in its current form, is high-level and does not include the requested specifics. While abstracts are conventionally concise, we will revise it to incorporate brief references to the evaluated models, the hard subsets of DeonticBench, and the nature of the observed improvements (including the non-uniformity across model strengths). Full experimental details, baselines, token costs, and any statistical analyses remain in Sections 4 and 5 of the paper. This change will make the abstract more self-contained without exceeding typical length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a purely empirical study introducing an agentic harness (DAR) for deontic reasoning and reporting performance on DeonticBench subsets. It contains no equations, derivations, fitted parameters, or claimed first-principles results. All load-bearing claims are experimental outcomes rather than reductions to self-defined inputs or self-citation chains, making the work self-contained against external benchmarks with no circular steps present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical content, free parameters, axioms, or invented entities appear in the abstract; the contribution is an empirical method description.

pith-pipeline@v0.9.1-grok · 5672 in / 1188 out tokens · 48400 ms · 2026-06-28T05:53:18.972801+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 16 canonical work pages · 11 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

and Li, Zhening and Khattab, Omar , title =

Zhang, Alex L. and Li, Zhening and Khattab, Omar , title =. 2026 , month = apr, day =

2026
[9]

2025 , month = nov, day =

Young, Justin , title =. 2025 , month = nov, day =

2025
[10]

DeonticBench: A Benchmark for Reasoning over Rules

DeonticBench: A Benchmark for Reasoning over Rules , author=. arXiv preprint arXiv:2604.04443 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

2025 , eprint=

Kimi K2: Open Agentic Intelligence , author=. 2025 , eprint=

2025
[12]

Qwen3-Coder-Next Technical Report

Qwen3-Coder-Next Technical Report , author=. arXiv preprint arXiv:2603.00729 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

2026 , url=

Terminus-KIRA: Boosting Frontier Model Performance on Terminal-Bench with Minimal Harness , author=. 2026 , url=

2026
[14]

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search , author=. arXiv preprint arXiv:2605.15184 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction , author=. arXiv preprint arXiv:2605.05242 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

2026 , publisher=

Agent Harness for Large Language Model Agents: A Survey , author=. 2026 , publisher=

2026
[17]

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning , author=. arXiv preprint arXiv:2605.05737 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

2026 , month = feb, day =

Lopopolo, Ryan , title =. 2026 , month = feb, day =

2026
[19]

Meta-Harness: End-to-End Optimization of Model Harnesses

Meta-harness: End-to-end optimization of model harnesses , author=. arXiv preprint arXiv:2603.28052 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Language Models and Logic Programs for Trustworthy Tax Reasoning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[21]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (

Ruiwen Zhou and Wenyue Hua and Liangming Pan and Sitao Cheng and Xiaobao Wu and En Yu and William Yang Wang , title=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (
[22]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026
[23]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Introducing Claude Sonnet 4.5 , howpublished =
[25]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[26]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Connecting Symbolic Statutory Reasoning with Legal Information Extraction

Holzenberger, Nils and Van Durme, Benjamin. Connecting Symbolic Statutory Reasoning with Legal Information Extraction. Proceedings of the Natural Legal Language Processing Workshop 2023. 2023. doi:10.18653/v1/2023.nllp-1.12

work page doi:10.18653/v1/2023.nllp-1.12 2023
[28]

Recursive Language Models

Recursive language models , author=. arXiv preprint arXiv:2512.24601 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

The Twelfth International Conference on Learning Representations , year=

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines , author=. The Twelfth International Conference on Learning Representations , year=
[30]

arXiv preprint arXiv:2510.05592 , year=

In-the-flow agentic system optimization for effective planning and tool use , author=. arXiv preprint arXiv:2510.05592 , year=

work page arXiv
[31]

arXiv preprint arXiv:2603.20278 , year=

Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis , author=. arXiv preprint arXiv:2603.20278 , year=

work page arXiv
[32]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Folio: Natural language reasoning with first-order logic , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[34]

arXiv preprint arXiv:2501.14851 , year=

Justlogic: A comprehensive benchmark for evaluating deductive reasoning in large language models , author=. arXiv preprint arXiv:2501.14851 , year=

work page arXiv
[35]

2026 , eprint=

CL-bench: A Benchmark for Context Learning , author=. 2026 , eprint=

2026
[36]

Proceedings of the Natural Legal Language Processing Workshop 2024 , pages=

Gaps or hallucinations? scrutinizing machine-generated legal analysis for fine-grained text evaluations , author=. Proceedings of the Natural Legal Language Processing Workshop 2024 , pages=

2024
[37]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Is that your final answer? test-time scaling improves selective question answering , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
[38]

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Conformal Thinking: Risk Control for Reasoning on a Compute Budget , author=. arXiv preprint arXiv:2602.03814 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

and Li, Zhening and Khattab, Omar , title =

Zhang, Alex L. and Li, Zhening and Khattab, Omar , title =. 2026 , month = apr, day =

2026

[9] [9]

2025 , month = nov, day =

Young, Justin , title =. 2025 , month = nov, day =

2025

[10] [10]

DeonticBench: A Benchmark for Reasoning over Rules

DeonticBench: A Benchmark for Reasoning over Rules , author=. arXiv preprint arXiv:2604.04443 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

2025 , eprint=

Kimi K2: Open Agentic Intelligence , author=. 2025 , eprint=

2025

[12] [12]

Qwen3-Coder-Next Technical Report

Qwen3-Coder-Next Technical Report , author=. arXiv preprint arXiv:2603.00729 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

2026 , url=

Terminus-KIRA: Boosting Frontier Model Performance on Terminal-Bench with Minimal Harness , author=. 2026 , url=

2026

[14] [14]

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search , author=. arXiv preprint arXiv:2605.15184 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction , author=. arXiv preprint arXiv:2605.05242 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

2026 , publisher=

Agent Harness for Large Language Model Agents: A Survey , author=. 2026 , publisher=

2026

[17] [17]

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning , author=. arXiv preprint arXiv:2605.05737 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

2026 , month = feb, day =

Lopopolo, Ryan , title =. 2026 , month = feb, day =

2026

[19] [19]

Meta-Harness: End-to-End Optimization of Model Harnesses

Meta-harness: End-to-end optimization of model harnesses , author=. arXiv preprint arXiv:2603.28052 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Language Models and Logic Programs for Trustworthy Tax Reasoning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[21] [21]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (

Ruiwen Zhou and Wenyue Hua and Liangming Pan and Sitao Cheng and Xiaobao Wu and En Yu and William Yang Wang , title=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (

[22] [22]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026

[23] [23]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Introducing Claude Sonnet 4.5 , howpublished =

[25] [25]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

[26] [26]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Connecting Symbolic Statutory Reasoning with Legal Information Extraction

Holzenberger, Nils and Van Durme, Benjamin. Connecting Symbolic Statutory Reasoning with Legal Information Extraction. Proceedings of the Natural Legal Language Processing Workshop 2023. 2023. doi:10.18653/v1/2023.nllp-1.12

work page doi:10.18653/v1/2023.nllp-1.12 2023

[28] [28]

Recursive Language Models

Recursive language models , author=. arXiv preprint arXiv:2512.24601 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

The Twelfth International Conference on Learning Representations , year=

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines , author=. The Twelfth International Conference on Learning Representations , year=

[30] [30]

arXiv preprint arXiv:2510.05592 , year=

In-the-flow agentic system optimization for effective planning and tool use , author=. arXiv preprint arXiv:2510.05592 , year=

work page arXiv

[31] [31]

arXiv preprint arXiv:2603.20278 , year=

Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis , author=. arXiv preprint arXiv:2603.20278 , year=

work page arXiv

[32] [32]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Folio: Natural language reasoning with first-order logic , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[34] [34]

arXiv preprint arXiv:2501.14851 , year=

Justlogic: A comprehensive benchmark for evaluating deductive reasoning in large language models , author=. arXiv preprint arXiv:2501.14851 , year=

work page arXiv

[35] [35]

2026 , eprint=

CL-bench: A Benchmark for Context Learning , author=. 2026 , eprint=

2026

[36] [36]

Proceedings of the Natural Legal Language Processing Workshop 2024 , pages=

Gaps or hallucinations? scrutinizing machine-generated legal analysis for fine-grained text evaluations , author=. Proceedings of the Natural Legal Language Processing Workshop 2024 , pages=

2024

[37] [37]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Is that your final answer? test-time scaling improves selective question answering , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

[38] [38]

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Conformal Thinking: Risk Control for Reasoning on a Compute Budget , author=. arXiv preprint arXiv:2602.03814 , year=

work page internal anchor Pith review Pith/arXiv arXiv