ACT: Agentic Classification Tree
Pith reviewed 2026-05-18 12:02 UTC · model grok-4.3
The pith
ACT adapts decision trees to text by using natural-language questions for each split, refined through LLM feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Agentic Classification Tree (ACT) formulates every decision-tree split as a natural-language question. These questions are evaluated for class impurity and iteratively refined with LLM feedback via TextGrad until separation improves. On text benchmarks the resulting trees match or surpass prompting baselines while delivering explicit, step-by-step decision paths that can be inspected and verified.
What carries the argument
The Agentic Classification Tree, which replaces numeric thresholds with natural-language questions that are optimized for impurity reduction through repeated LLM feedback.
If this is right
- High-stakes text decisions can be traced through explicit question sequences rather than opaque model outputs.
- Regulators gain a concrete set of rules to audit instead of free-form reasoning chains.
- The tree structure lets practitioners identify exactly which question determined a given classification.
- The method supplies a direct way to combine the transparency of decision trees with the flexibility of LLMs on unstructured inputs.
Where Pith is reading between the lines
- The same question-generation loop could be tested on image or audio data by letting the LLM describe visual or acoustic features.
- If question quality varies across different base LLMs, performance and interpretability may shift with model choice.
- Combining ACT with existing fairness constraints on the questions might limit hidden biases that the current refinement does not address.
- The approach suggests a broader pattern: turning other structured learning algorithms into agentic versions that operate on language.
Load-bearing premise
Iterative LLM feedback reliably produces questions that sharpen class separation without introducing hallucinations, inconsistencies, or biases that would make the paths unreliable.
What would settle it
Run the same text benchmark multiple times; if the generated questions produce unstable accuracy, contradictory paths, or fail to reduce impurity compared with non-optimized questions, the central claim does not hold.
read the original abstract
When used in high-stakes settings, AI systems are expected to produce decisions that are transparent, interpretable and auditable, a requirement increasingly expected by regulations. Decision trees such as CART provide clear and verifiable rules, but they are restricted to structured tabular data and cannot operate directly on unstructured inputs such as text. In practice, large language models (LLMs) are widely used for such data, yet prompting strategies such as chain-of-thought or prompt optimization still rely on free-form reasoning, limiting their ability to ensure trustworthy behaviors. We present the Agentic Classification Tree (ACT), which extends decision-tree methodology to unstructured inputs by formulating each split as a natural-language question, refined through impurity-based evaluation and LLM feedback via TextGrad. Experiments on text benchmarks show that ACT matches or surpasses prompting-based baselines while producing transparent and interpretable decision paths.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Agentic Classification Tree (ACT), which extends classical decision-tree methods to unstructured text inputs. Each split is realized as an LLM-generated natural-language question that is iteratively refined by an external TextGrad optimization loop driven by an impurity measure; the resulting tree is claimed to match or exceed prompting-based baselines on text classification benchmarks while yielding transparent, auditable decision paths.
Significance. If the central performance and interpretability claims are substantiated, ACT would constitute a useful methodological bridge between the verifiable structure of decision trees and the flexibility of LLMs for non-tabular data. The explicit use of impurity feedback to guide natural-language splits is a distinctive design choice that could support regulatory demands for auditable AI; however, the absence of quantitative results, ablations, and error analysis in the current draft limits assessment of whether these benefits are realized.
major comments (3)
- [§4] §4 (Experiments): the abstract and method description assert that ACT 'matches or surpasses prompting-based baselines,' yet no tables, accuracy figures, baseline definitions, statistical tests, or error bars are supplied. Without these data the central empirical claim cannot be evaluated and is therefore load-bearing for acceptance.
- [§3.2] §3.2 (TextGrad refinement loop): the construction relies on the assumption that iterative LLM feedback produces consistent, non-hallucinated natural-language questions that measurably reduce impurity. No ablation isolating the contribution of the TextGrad loop versus a single-pass question generator is reported; this omission directly undermines both the performance and the interpretability claims.
- [§3.1] §3.1 (Impurity evaluation): the manuscript does not define how impurity is computed from free-form natural-language questions (e.g., whether it uses an external classifier, embedding similarity, or LLM-as-judge). This missing operational detail is prerequisite to reproducing the claimed optimization and to verifying that the generated splits remain faithful to the data.
minor comments (2)
- [§3] The notation for the impurity function and the TextGrad update rule should be formalized with explicit equations rather than prose descriptions.
- [§5] A brief discussion of potential failure modes (hallucinated splits, spurious correlations, or inconsistent phrasing across tree levels) would strengthen the interpretability argument.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and completeness of our manuscript. We address each major comment below and indicate the revisions made.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): the abstract and method description assert that ACT 'matches or surpasses prompting-based baselines,' yet no tables, accuracy figures, baseline definitions, statistical tests, or error bars are supplied. Without these data the central empirical claim cannot be evaluated and is therefore load-bearing for acceptance.
Authors: We agree that the experimental section in the submitted draft was incomplete. In the revised manuscript, we have added detailed experimental results including tables with accuracy figures for ACT and prompting baselines (such as standard prompting, chain-of-thought, and prompt optimization methods), baseline definitions, results from multiple random seeds with error bars, and statistical tests (e.g., paired t-tests) to substantiate the performance claims. revision: yes
-
Referee: [§3.2] §3.2 (TextGrad refinement loop): the construction relies on the assumption that iterative LLM feedback produces consistent, non-hallucinated natural-language questions that measurably reduce impurity. No ablation isolating the contribution of the TextGrad loop versus a single-pass question generator is reported; this omission directly undermines both the performance and the interpretability claims.
Authors: We agree that an ablation is important to validate the contribution of the TextGrad refinement. We have added an ablation study in the revised manuscript that compares the iterative TextGrad loop against a single-pass question generator, demonstrating measurable improvements in impurity reduction and overall classification performance. revision: yes
-
Referee: [§3.1] §3.1 (Impurity evaluation): the manuscript does not define how impurity is computed from free-form natural-language questions (e.g., whether it uses an external classifier, embedding similarity, or LLM-as-judge). This missing operational detail is prerequisite to reproducing the claimed optimization and to verifying that the generated splits remain faithful to the data.
Authors: We appreciate this observation. The revised manuscript now includes a detailed description in §3.1 of how impurity is evaluated: we use an LLM-as-judge to assess class separation based on the natural-language question, combined with embedding-based similarity measures for the resulting data partitions. We provide the exact evaluation prompt and algorithm to ensure reproducibility. revision: yes
Circularity Check
No significant circularity: method uses external TextGrad loop and impurity measures independent of final metric
full rationale
The paper defines ACT as an algorithmic procedure that generates natural-language splits via LLM, refines them with TextGrad feedback on impurity, and builds a tree evaluated on held-out benchmarks. No equation or step equates a claimed prediction or uniqueness result to a fitted parameter by construction, nor does any load-bearing premise reduce to a self-citation chain. The derivation is self-contained against external benchmarks because the optimization loop and impurity criterion are defined outside the final accuracy numbers, and experiments report comparative performance rather than tautological recovery of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can generate and iteratively improve natural-language questions that reduce class impurity in a decision-tree setting.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
each split as a natural-language question, refined through impurity-based evaluation and LLM feedback via TextGrad
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
transparent and interpretable decision paths
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.