ACT: Agentic Classification Tree

James Zou; Marcin Detyniecki; Sylvain Lamprier; Thibault Laugel; Tim Arni; Vincent Grari

arxiv: 2509.26433 · v4 · submitted 2025-09-30 · 💻 cs.LG · cs.AI

ACT: Agentic Classification Tree

Vincent Grari , Tim Arni , Thibault Laugel , Sylvain Lamprier , James Zou , Marcin Detyniecki This is my paper

Pith reviewed 2026-05-18 12:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords decision treestext classificationinterpretabilitylarge language modelsagentic methodsTextGradtransparent AIunstructured data

0 comments

The pith

ACT adapts decision trees to text by using natural-language questions for each split, refined through LLM feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to extend classic decision trees beyond tabular data so they can classify unstructured text. Each split becomes a natural-language question that an LLM generates and then improves using impurity measures and iterative feedback. The resulting trees produce classification paths that remain human-readable and auditable. Experiments indicate the approach reaches accuracy levels that match or exceed direct prompting techniques. This setup addresses the need for verifiable rules when AI handles high-stakes text decisions.

Core claim

The Agentic Classification Tree (ACT) formulates every decision-tree split as a natural-language question. These questions are evaluated for class impurity and iteratively refined with LLM feedback via TextGrad until separation improves. On text benchmarks the resulting trees match or surpass prompting baselines while delivering explicit, step-by-step decision paths that can be inspected and verified.

What carries the argument

The Agentic Classification Tree, which replaces numeric thresholds with natural-language questions that are optimized for impurity reduction through repeated LLM feedback.

If this is right

High-stakes text decisions can be traced through explicit question sequences rather than opaque model outputs.
Regulators gain a concrete set of rules to audit instead of free-form reasoning chains.
The tree structure lets practitioners identify exactly which question determined a given classification.
The method supplies a direct way to combine the transparency of decision trees with the flexibility of LLMs on unstructured inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same question-generation loop could be tested on image or audio data by letting the LLM describe visual or acoustic features.
If question quality varies across different base LLMs, performance and interpretability may shift with model choice.
Combining ACT with existing fairness constraints on the questions might limit hidden biases that the current refinement does not address.
The approach suggests a broader pattern: turning other structured learning algorithms into agentic versions that operate on language.

Load-bearing premise

Iterative LLM feedback reliably produces questions that sharpen class separation without introducing hallucinations, inconsistencies, or biases that would make the paths unreliable.

What would settle it

Run the same text benchmark multiple times; if the generated questions produce unstable accuracy, contradictory paths, or fail to reduce impurity compared with non-optimized questions, the central claim does not hold.

read the original abstract

When used in high-stakes settings, AI systems are expected to produce decisions that are transparent, interpretable and auditable, a requirement increasingly expected by regulations. Decision trees such as CART provide clear and verifiable rules, but they are restricted to structured tabular data and cannot operate directly on unstructured inputs such as text. In practice, large language models (LLMs) are widely used for such data, yet prompting strategies such as chain-of-thought or prompt optimization still rely on free-form reasoning, limiting their ability to ensure trustworthy behaviors. We present the Agentic Classification Tree (ACT), which extends decision-tree methodology to unstructured inputs by formulating each split as a natural-language question, refined through impurity-based evaluation and LLM feedback via TextGrad. Experiments on text benchmarks show that ACT matches or surpasses prompting-based baselines while producing transparent and interpretable decision paths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Agentic Classification Tree (ACT), which extends classical decision-tree methods to unstructured text inputs. Each split is realized as an LLM-generated natural-language question that is iteratively refined by an external TextGrad optimization loop driven by an impurity measure; the resulting tree is claimed to match or exceed prompting-based baselines on text classification benchmarks while yielding transparent, auditable decision paths.

Significance. If the central performance and interpretability claims are substantiated, ACT would constitute a useful methodological bridge between the verifiable structure of decision trees and the flexibility of LLMs for non-tabular data. The explicit use of impurity feedback to guide natural-language splits is a distinctive design choice that could support regulatory demands for auditable AI; however, the absence of quantitative results, ablations, and error analysis in the current draft limits assessment of whether these benefits are realized.

major comments (3)

[§4] §4 (Experiments): the abstract and method description assert that ACT 'matches or surpasses prompting-based baselines,' yet no tables, accuracy figures, baseline definitions, statistical tests, or error bars are supplied. Without these data the central empirical claim cannot be evaluated and is therefore load-bearing for acceptance.
[§3.2] §3.2 (TextGrad refinement loop): the construction relies on the assumption that iterative LLM feedback produces consistent, non-hallucinated natural-language questions that measurably reduce impurity. No ablation isolating the contribution of the TextGrad loop versus a single-pass question generator is reported; this omission directly undermines both the performance and the interpretability claims.
[§3.1] §3.1 (Impurity evaluation): the manuscript does not define how impurity is computed from free-form natural-language questions (e.g., whether it uses an external classifier, embedding similarity, or LLM-as-judge). This missing operational detail is prerequisite to reproducing the claimed optimization and to verifying that the generated splits remain faithful to the data.

minor comments (2)

[§3] The notation for the impurity function and the TextGrad update rule should be formalized with explicit equations rather than prose descriptions.
[§5] A brief discussion of potential failure modes (hallucinated splits, spurious correlations, or inconsistent phrasing across tree levels) would strengthen the interpretability argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and completeness of our manuscript. We address each major comment below and indicate the revisions made.

read point-by-point responses

Referee: [§4] §4 (Experiments): the abstract and method description assert that ACT 'matches or surpasses prompting-based baselines,' yet no tables, accuracy figures, baseline definitions, statistical tests, or error bars are supplied. Without these data the central empirical claim cannot be evaluated and is therefore load-bearing for acceptance.

Authors: We agree that the experimental section in the submitted draft was incomplete. In the revised manuscript, we have added detailed experimental results including tables with accuracy figures for ACT and prompting baselines (such as standard prompting, chain-of-thought, and prompt optimization methods), baseline definitions, results from multiple random seeds with error bars, and statistical tests (e.g., paired t-tests) to substantiate the performance claims. revision: yes
Referee: [§3.2] §3.2 (TextGrad refinement loop): the construction relies on the assumption that iterative LLM feedback produces consistent, non-hallucinated natural-language questions that measurably reduce impurity. No ablation isolating the contribution of the TextGrad loop versus a single-pass question generator is reported; this omission directly undermines both the performance and the interpretability claims.

Authors: We agree that an ablation is important to validate the contribution of the TextGrad refinement. We have added an ablation study in the revised manuscript that compares the iterative TextGrad loop against a single-pass question generator, demonstrating measurable improvements in impurity reduction and overall classification performance. revision: yes
Referee: [§3.1] §3.1 (Impurity evaluation): the manuscript does not define how impurity is computed from free-form natural-language questions (e.g., whether it uses an external classifier, embedding similarity, or LLM-as-judge). This missing operational detail is prerequisite to reproducing the claimed optimization and to verifying that the generated splits remain faithful to the data.

Authors: We appreciate this observation. The revised manuscript now includes a detailed description in §3.1 of how impurity is evaluated: we use an LLM-as-judge to assess class separation based on the natural-language question, combined with embedding-based similarity measures for the resulting data partitions. We provide the exact evaluation prompt and algorithm to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity: method uses external TextGrad loop and impurity measures independent of final metric

full rationale

The paper defines ACT as an algorithmic procedure that generates natural-language splits via LLM, refines them with TextGrad feedback on impurity, and builds a tree evaluated on held-out benchmarks. No equation or step equates a claimed prediction or uniqueness result to a fitted parameter by construction, nor does any load-bearing premise reduce to a self-citation chain. The derivation is self-contained against external benchmarks because the optimization loop and impurity criterion are defined outside the final accuracy numbers, and experiments report comparative performance rather than tautological recovery of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can serve as reliable oracles for refining classification questions and on the engineering choice to use TextGrad for that refinement; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Large language models can generate and iteratively improve natural-language questions that reduce class impurity in a decision-tree setting.
The method depends on this capability for both initial split generation and TextGrad-based refinement.

pith-pipeline@v0.9.0 · 5678 in / 1250 out tokens · 55538 ms · 2026-05-18T12:02:05.149741+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

each split as a natural-language question, refined through impurity-based evaluation and LLM feedback via TextGrad
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transparent and interpretable decision paths

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.