CoFEE: Reasoning Control for LLM-Based Feature Discovery
Pith reviewed 2026-05-09 22:19 UTC · model grok-4.3
The pith
Enforcing cognitive behaviors in LLMs produces more predictive features with lower cost than free-form prompting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoFEE enforces cognitive behaviors such as backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected paths. In controlled comparisons against vanilla LLM prompts, this produces features with a 15.2 percent higher average Success Rate Score, generates 29 percent fewer features, and reduces costs by 53.3 percent. Held-out feature evaluation indicates that the resulting features generalize beyond the discovery data.
What carries the argument
CoFEE (Cognitive Feature Engineering Engine), a framework that injects cognitive behaviors as structured inductive biases to guide LLM reasoning over candidate features.
If this is right
- Features generated under cognitive control show higher predictability on evaluation metrics.
- Fewer candidate features need to be proposed and tested, lowering overall complexity.
- Resource costs drop because fewer low-value features are generated and evaluated.
- Generalization improves when reasoning paths are verified against leakage and observability rules.
Where Pith is reading between the lines
- The same style of reasoning control could be applied to other LLM tasks that require abstraction or causal separation.
- Embedding CoFEE-style constraints inside automated machine-learning pipelines might reduce the need for manual feature review.
- Testing the approach on data types beyond those evaluated here would show whether the efficiency and quality gains are domain-specific.
Load-bearing premise
The observed gains come primarily from the enforced cognitive behaviors themselves rather than from other differences in prompt wording or evaluation procedure.
What would settle it
An experiment that keeps every other implementation detail identical but removes the cognitive behavior constraints, then measures whether success rates, feature count, and costs remain unchanged.
Figures
read the original abstract
Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoFEE (Cognitive Feature Engineering Engine), a framework that enforces specific cognitive behaviors—backward chaining from outcomes, subgoal decomposition, verification against observability/leakage criteria, and explicit backtracking—in LLMs for discovering predictive features from unstructured data. These behaviors are positioned as inductive biases drawn from ML. In a controlled comparison, the work claims that CoFEE produces features with higher empirical predictability than unconstrained vanilla LLM prompts, reporting an average 15.2% higher Success Rate Score, 29% fewer features generated, and 53.3% lower costs, with generalization assessed via held-out feature evaluation.
Significance. If the gains are attributable to the specific cognitive behaviors, the result would demonstrate a practical method for injecting ML-style inductive biases into LLM reasoning for feature discovery, improving both quality and efficiency over naive prompting. This could influence prompt-engineering practices in automated ML pipelines and highlight the value of structured reasoning control for tasks requiring abstraction and leakage avoidance.
major comments (2)
- [Results section] Results section: The central claim that the listed cognitive behaviors (backward chaining, subgoal decomposition, verification, backtracking) are the key inductive biases responsible for the 15.2% Success Rate Score lift, 29% feature reduction, and 53.3% cost reduction rests on a single head-to-head comparison against 'unconstrained vanilla LLM prompts.' No ablations that remove or vary individual behaviors, and no control condition using equivalently detailed but non-specific structured instructions (e.g., generic iterative refinement), are reported. This leaves open the possibility that any sufficiently detailed prompting explains the gains rather than the claimed ML-inspired behaviors.
- [Evaluation / Methods] Evaluation / Methods: The abstract and reported results provide no details on experimental design, baseline prompt definitions, exact Success Rate Score calculation, statistical significance testing, dataset characteristics, or controls for potential biases in the held-out evaluation. These omissions are load-bearing for verifying whether the observed improvements reflect genuine generalization or artifacts of the test setup.
minor comments (2)
- [Methods] Clarify the precise definition and computation of the Success Rate Score metric, including how it is calculated on held-out data, in the methods or evaluation section.
- [Abstract] The abstract would benefit from a brief statement of the data domains or tasks used in the experiments to contextualize the reported gains.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for strengthening the manuscript. We address each major comment below, committing to revisions where appropriate to enhance the evidence for our claims.
read point-by-point responses
-
Referee: [Results section] Results section: The central claim that the listed cognitive behaviors (backward chaining, subgoal decomposition, verification, backtracking) are the key inductive biases responsible for the 15.2% Success Rate Score lift, 29% feature reduction, and 53.3% cost reduction rests on a single head-to-head comparison against 'unconstrained vanilla LLM prompts.' No ablations that remove or vary individual behaviors, and no control condition using equivalently detailed but non-specific structured instructions (e.g., generic iterative refinement), are reported. This leaves open the possibility that any sufficiently detailed prompting explains the gains rather than the claimed ML-inspired behaviors.
Authors: We agree that this is a valid point and that the current results do not fully isolate the effects of the specific cognitive behaviors. The comparison to vanilla prompting shows the benefit of the structured approach overall, but additional controls are needed to rule out that any detailed prompting would suffice. In the revised manuscript, we will add ablation studies that disable individual behaviors one at a time and include a baseline with generic iterative refinement instructions of similar detail. These experiments will be conducted on the same datasets to quantify the unique contribution of the ML-inspired inductive biases. revision: yes
-
Referee: [Evaluation / Methods] Evaluation / Methods: The abstract and reported results provide no details on experimental design, baseline prompt definitions, exact Success Rate Score calculation, statistical significance testing, dataset characteristics, or controls for potential biases in the held-out evaluation. These omissions are load-bearing for verifying whether the observed improvements reflect genuine generalization or artifacts of the test setup.
Authors: We acknowledge the need for greater transparency in the reporting of our experimental setup. While some details are present in the full manuscript, they are insufficiently detailed. We will revise the Methods and Evaluation sections to provide complete information on the experimental design, including the precise baseline prompt templates, the mathematical definition of the Success Rate Score, results of statistical significance tests, full descriptions of the datasets used, and the methodology for held-out evaluation including any measures taken to prevent data leakage or bias. This will allow readers to fully assess the validity of the generalization claims. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central claim rests on an empirical head-to-head comparison of CoFEE (with enforced cognitive behaviors) versus unconstrained vanilla LLM prompts, measured via Success Rate Score on held-out data, feature count, and cost. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the reported 15.2% lift and efficiency gains are presented as observed outcomes rather than derived by construction from the method's inputs or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can reliably execute structured cognitive behaviors such as backward chaining and leakage verification when prompted
invented entities (1)
-
CoFEE framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
GPTree: Towards explainable decision-making via LLM-powered decision trees,
S. Xiong, Y . Ihlamur, F. Alican, and A. O. Yin, “GPTree: Towards explainable decision-making via LLM-powered decision trees,” 2024
work page 2024
-
[2]
An empirical analysis of feature engineering for predictive modeling,
J. Heaton, “An empirical analysis of feature engineering for predictive modeling,” 2016
work page 2016
-
[3]
Multimodal chain-of-thought reasoning: A comprehensive survey,
Y . Wang, S. Wu, Y . Zhang, W. Wang, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” 2025
work page 2025
-
[4]
Boosting multimodal reasoning with automated structured thinking,
J. Wu, M. Feng, S. Zhang, F. Lv, R. Jin, F. Che, Z. Wen, and J. Tao, “Boosting multimodal reasoning with automated structured thinking,” 2025
work page 2025
-
[5]
Cognitive behaviors that enable self-improving reasoners: Four habits of highly effective STaRs,
K. Gandhi, A. Chakravarthy, A. Singh, N. Lile, and N. D. Goodman, “Cognitive behaviors that enable self-improving reasoners: Four habits of highly effective STaRs,” 2025
work page 2025
-
[6]
ReAct: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” 2022
work page 2022
-
[7]
Position: AI evaluation should learn from how we test humans,
Y . Zhuang, Q. Liu, Z. A. Pardos, P. C. Kyllonen, J. Zu, Z. Huang, S. Wang, and E. Chen, “Position: AI evaluation should learn from how we test humans,” 2024
work page 2024
-
[8]
VCBench: Benchmarking LLMs in venture capital,
R. Chen, J. Ternasky, A. S. Kwesi, B. Griffin, A. O. Yin, Z. Salifu, K. Amoaba, X. Mu, F. Alican, and Y . Ihlamur, “VCBench: Benchmarking LLMs in venture capital,” 2025. IX. APPENDIX Two tables included comparing feature quality metrics for the two discovery approaches. Precision denotes the condi- tional success probability among founders exhibiting the ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.