CoFEE: Reasoning Control for LLM-Based Feature Discovery

Aaron Ontoyin Yin; Ben Griffin; Fuat Alican; Joseph Ternasky; Kelvin Amoaba; Maximilian Westermann; Yagiz Ihlamur; Yigit Ihlamur; Zakari Salifu

arxiv: 2604.21584 · v1 · submitted 2026-04-23 · 💻 cs.AI · cs.CE· cs.LG

CoFEE: Reasoning Control for LLM-Based Feature Discovery

Maximilian Westermann , Ben Griffin , Aaron Ontoyin Yin , Zakari Salifu , Yagiz Ihlamur , Kelvin Amoaba , Joseph Ternasky , Fuat Alican

show 1 more author

Yigit Ihlamur

This is my paper

Pith reviewed 2026-05-09 22:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.LG

keywords feature discoverylarge language modelsreasoning controlcognitive behaviorsinductive biasesunstructured datamachine learning

0 comments

The pith

Enforcing cognitive behaviors in LLMs produces more predictive features with lower cost than free-form prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that adding structured reasoning steps to LLMs during feature discovery from unstructured data yields features with stronger empirical links to the target outcome. A sympathetic reader would care because unconstrained LLM generation often produces weak, leaking, or post-outcome signals that fail to generalize. By treating cognitive behaviors as inductive biases, the approach turns open-ended generation into a controlled process that improves both quality and efficiency. The reported gains show higher success on held-out tests while using fewer resources.

Core claim

CoFEE enforces cognitive behaviors such as backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected paths. In controlled comparisons against vanilla LLM prompts, this produces features with a 15.2 percent higher average Success Rate Score, generates 29 percent fewer features, and reduces costs by 53.3 percent. Held-out feature evaluation indicates that the resulting features generalize beyond the discovery data.

What carries the argument

CoFEE (Cognitive Feature Engineering Engine), a framework that injects cognitive behaviors as structured inductive biases to guide LLM reasoning over candidate features.

If this is right

Features generated under cognitive control show higher predictability on evaluation metrics.
Fewer candidate features need to be proposed and tested, lowering overall complexity.
Resource costs drop because fewer low-value features are generated and evaluated.
Generalization improves when reasoning paths are verified against leakage and observability rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same style of reasoning control could be applied to other LLM tasks that require abstraction or causal separation.
Embedding CoFEE-style constraints inside automated machine-learning pipelines might reduce the need for manual feature review.
Testing the approach on data types beyond those evaluated here would show whether the efficiency and quality gains are domain-specific.

Load-bearing premise

The observed gains come primarily from the enforced cognitive behaviors themselves rather than from other differences in prompt wording or evaluation procedure.

What would settle it

An experiment that keeps every other implementation detail identical but removes the cognitive behavior constraints, then measures whether success rates, feature count, and costs remain unchanged.

Figures

Figures reproduced from arXiv: 2604.21584 by Aaron Ontoyin Yin, Ben Griffin, Fuat Alican, Joseph Ternasky, Kelvin Amoaba, Maximilian Westermann, Yagiz Ihlamur, Yigit Ihlamur, Zakari Salifu.

**Figure 1.** Figure 1: Overview of the CoFEE pipeline. effectiveness of features. Recent work such as GPTree addresses this limitation partially by integrating large language models (LLMs) into a decision tree, allowing transparent and interpretable decision tree reasoning to provide valuable information for VC decision-making [1]. A complementary challenge, feature engineering, arises upstream of model construction. Previous w… view at source ↗

**Figure 2.** Figure 2: CoFEE pipeline. Agent 1 performs cognitive feature selection to construct an initial master list, which is refined via semantic similarity by Agent 2. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Diagram illustrating the Agent 2 process, in which semantically [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoFEE adds named cognitive behaviors to LLM feature discovery and reports gains over plain prompts, but the gains are not isolated from general structured instructions.

read the letter

The key takeaway is that CoFEE applies structured cognitive behaviors to LLM reasoning for feature discovery and reports measurable improvements over basic prompting. Those gains look real in the reported numbers, but the design leaves room for doubt about what exactly caused them. What the paper brings is a named framework that turns ML-style reasoning tactics into prompts for LLMs. Backward chaining from the outcome, breaking down subgoals, checking for observability and leakage, and backtracking on bad paths are all pulled in as controls. This is a step beyond ad-hoc prompting because it draws directly from how we think about feature engineering in traditional ML. The results section shows an average 15.2 percent higher success rate score, 29 percent fewer features produced, and 53.3 percent lower costs. Using held-out data for evaluation is a good choice here, as it tests whether the features actually predict on new data rather than just fitting the discovery set. The main weakness is that the comparison does not separate the specific behaviors from the general effect of giving the model more detailed instructions. The paper only shows CoFEE against vanilla LLM prompts. Without ablations that remove one behavior at a time or a control condition with non-specific but structured guidance, any organized prompting could be responsible for the better features and efficiency. The abstract does not give full experimental details on how the success rate was calculated or what statistical tests were used, which makes the claims harder to assess at this stage. This work is aimed at practitioners who want to automate feature engineering with LLMs on messy data sources. Data scientists dealing with unstructured inputs could pick up the CoFEE approach and adapt it. It is not a theoretical advance but a practical one that could fit into existing workflows. The paper deserves a serious referee. It has a clear idea, some empirical support, and addresses a real pain point in applied ML. Reviewers can push for the missing controls and more transparency on the setup. Recommendation: Send it to peer review with feedback focused on strengthening the causal link between the behaviors and the outcomes.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoFEE (Cognitive Feature Engineering Engine), a framework that enforces specific cognitive behaviors—backward chaining from outcomes, subgoal decomposition, verification against observability/leakage criteria, and explicit backtracking—in LLMs for discovering predictive features from unstructured data. These behaviors are positioned as inductive biases drawn from ML. In a controlled comparison, the work claims that CoFEE produces features with higher empirical predictability than unconstrained vanilla LLM prompts, reporting an average 15.2% higher Success Rate Score, 29% fewer features generated, and 53.3% lower costs, with generalization assessed via held-out feature evaluation.

Significance. If the gains are attributable to the specific cognitive behaviors, the result would demonstrate a practical method for injecting ML-style inductive biases into LLM reasoning for feature discovery, improving both quality and efficiency over naive prompting. This could influence prompt-engineering practices in automated ML pipelines and highlight the value of structured reasoning control for tasks requiring abstraction and leakage avoidance.

major comments (2)

[Results section] Results section: The central claim that the listed cognitive behaviors (backward chaining, subgoal decomposition, verification, backtracking) are the key inductive biases responsible for the 15.2% Success Rate Score lift, 29% feature reduction, and 53.3% cost reduction rests on a single head-to-head comparison against 'unconstrained vanilla LLM prompts.' No ablations that remove or vary individual behaviors, and no control condition using equivalently detailed but non-specific structured instructions (e.g., generic iterative refinement), are reported. This leaves open the possibility that any sufficiently detailed prompting explains the gains rather than the claimed ML-inspired behaviors.
[Evaluation / Methods] Evaluation / Methods: The abstract and reported results provide no details on experimental design, baseline prompt definitions, exact Success Rate Score calculation, statistical significance testing, dataset characteristics, or controls for potential biases in the held-out evaluation. These omissions are load-bearing for verifying whether the observed improvements reflect genuine generalization or artifacts of the test setup.

minor comments (2)

[Methods] Clarify the precise definition and computation of the Success Rate Score metric, including how it is calculated on held-out data, in the methods or evaluation section.
[Abstract] The abstract would benefit from a brief statement of the data domains or tasks used in the experiments to contextualize the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the manuscript. We address each major comment below, committing to revisions where appropriate to enhance the evidence for our claims.

read point-by-point responses

Referee: [Results section] Results section: The central claim that the listed cognitive behaviors (backward chaining, subgoal decomposition, verification, backtracking) are the key inductive biases responsible for the 15.2% Success Rate Score lift, 29% feature reduction, and 53.3% cost reduction rests on a single head-to-head comparison against 'unconstrained vanilla LLM prompts.' No ablations that remove or vary individual behaviors, and no control condition using equivalently detailed but non-specific structured instructions (e.g., generic iterative refinement), are reported. This leaves open the possibility that any sufficiently detailed prompting explains the gains rather than the claimed ML-inspired behaviors.

Authors: We agree that this is a valid point and that the current results do not fully isolate the effects of the specific cognitive behaviors. The comparison to vanilla prompting shows the benefit of the structured approach overall, but additional controls are needed to rule out that any detailed prompting would suffice. In the revised manuscript, we will add ablation studies that disable individual behaviors one at a time and include a baseline with generic iterative refinement instructions of similar detail. These experiments will be conducted on the same datasets to quantify the unique contribution of the ML-inspired inductive biases. revision: yes
Referee: [Evaluation / Methods] Evaluation / Methods: The abstract and reported results provide no details on experimental design, baseline prompt definitions, exact Success Rate Score calculation, statistical significance testing, dataset characteristics, or controls for potential biases in the held-out evaluation. These omissions are load-bearing for verifying whether the observed improvements reflect genuine generalization or artifacts of the test setup.

Authors: We acknowledge the need for greater transparency in the reporting of our experimental setup. While some details are present in the full manuscript, they are insufficiently detailed. We will revise the Methods and Evaluation sections to provide complete information on the experimental design, including the precise baseline prompt templates, the mathematical definition of the Success Rate Score, results of statistical significance tests, full descriptions of the datasets used, and the methodology for held-out evaluation including any measures taken to prevent data leakage or bias. This will allow readers to fully assess the validity of the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claim rests on an empirical head-to-head comparison of CoFEE (with enforced cognitive behaviors) versus unconstrained vanilla LLM prompts, measured via Success Rate Score on held-out data, feature count, and cost. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the reported 15.2% lift and efficiency gains are presented as observed outcomes rather than derived by construction from the method's inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on the abstract, the central claim rests on the assumption that LLMs can follow and benefit from enforced cognitive behaviors as inductive biases. No explicit free parameters are described, and the framework itself is the primary new element introduced.

axioms (1)

domain assumption LLMs can reliably execute structured cognitive behaviors such as backward chaining and leakage verification when prompted
The framework depends on this capability to induce the desired reasoning patterns during feature generation.

invented entities (1)

CoFEE framework no independent evidence
purpose: To enforce cognitive behaviors as structured inductive biases for LLM feature discovery
Newly proposed system whose effectiveness is tested in the work.

pith-pipeline@v0.9.0 · 5619 in / 1196 out tokens · 50681 ms · 2026-05-09T22:19:09.335371+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

GPTree: Towards explainable decision-making via LLM-powered decision trees,

S. Xiong, Y . Ihlamur, F. Alican, and A. O. Yin, “GPTree: Towards explainable decision-making via LLM-powered decision trees,” 2024

work page 2024
[2]

An empirical analysis of feature engineering for predictive modeling,

J. Heaton, “An empirical analysis of feature engineering for predictive modeling,” 2016

work page 2016
[3]

Multimodal chain-of-thought reasoning: A comprehensive survey,

Y . Wang, S. Wu, Y . Zhang, W. Wang, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” 2025

work page 2025
[4]

Boosting multimodal reasoning with automated structured thinking,

J. Wu, M. Feng, S. Zhang, F. Lv, R. Jin, F. Che, Z. Wen, and J. Tao, “Boosting multimodal reasoning with automated structured thinking,” 2025

work page 2025
[5]

Cognitive behaviors that enable self-improving reasoners: Four habits of highly effective STaRs,

K. Gandhi, A. Chakravarthy, A. Singh, N. Lile, and N. D. Goodman, “Cognitive behaviors that enable self-improving reasoners: Four habits of highly effective STaRs,” 2025

work page 2025
[6]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” 2022

work page 2022
[7]

Position: AI evaluation should learn from how we test humans,

Y . Zhuang, Q. Liu, Z. A. Pardos, P. C. Kyllonen, J. Zu, Z. Huang, S. Wang, and E. Chen, “Position: AI evaluation should learn from how we test humans,” 2024

work page 2024
[8]

VCBench: Benchmarking LLMs in venture capital,

R. Chen, J. Ternasky, A. S. Kwesi, B. Griffin, A. O. Yin, Z. Salifu, K. Amoaba, X. Mu, F. Alican, and Y . Ihlamur, “VCBench: Benchmarking LLMs in venture capital,” 2025. IX. APPENDIX Two tables included comparing feature quality metrics for the two discovery approaches. Precision denotes the condi- tional success probability among founders exhibiting the ...

work page 2025

[1] [1]

GPTree: Towards explainable decision-making via LLM-powered decision trees,

S. Xiong, Y . Ihlamur, F. Alican, and A. O. Yin, “GPTree: Towards explainable decision-making via LLM-powered decision trees,” 2024

work page 2024

[2] [2]

An empirical analysis of feature engineering for predictive modeling,

J. Heaton, “An empirical analysis of feature engineering for predictive modeling,” 2016

work page 2016

[3] [3]

Multimodal chain-of-thought reasoning: A comprehensive survey,

Y . Wang, S. Wu, Y . Zhang, W. Wang, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” 2025

work page 2025

[4] [4]

Boosting multimodal reasoning with automated structured thinking,

J. Wu, M. Feng, S. Zhang, F. Lv, R. Jin, F. Che, Z. Wen, and J. Tao, “Boosting multimodal reasoning with automated structured thinking,” 2025

work page 2025

[5] [5]

Cognitive behaviors that enable self-improving reasoners: Four habits of highly effective STaRs,

K. Gandhi, A. Chakravarthy, A. Singh, N. Lile, and N. D. Goodman, “Cognitive behaviors that enable self-improving reasoners: Four habits of highly effective STaRs,” 2025

work page 2025

[6] [6]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” 2022

work page 2022

[7] [7]

Position: AI evaluation should learn from how we test humans,

Y . Zhuang, Q. Liu, Z. A. Pardos, P. C. Kyllonen, J. Zu, Z. Huang, S. Wang, and E. Chen, “Position: AI evaluation should learn from how we test humans,” 2024

work page 2024

[8] [8]

VCBench: Benchmarking LLMs in venture capital,

R. Chen, J. Ternasky, A. S. Kwesi, B. Griffin, A. O. Yin, Z. Salifu, K. Amoaba, X. Mu, F. Alican, and Y . Ihlamur, “VCBench: Benchmarking LLMs in venture capital,” 2025. IX. APPENDIX Two tables included comparing feature quality metrics for the two discovery approaches. Precision denotes the condi- tional success probability among founders exhibiting the ...

work page 2025