CoFEH: LLM-driven Feature Engineering Empowered by Collaborative Bayesian Hyperparameter Optimization

Beicheng Xu; Bin Cui; Keyao Ding; Wei Liu; Yupeng Lu

arxiv: 2602.09851 · v2 · pith:HD67Y2Q6new · submitted 2026-02-10 · 💻 cs.LG

CoFEH: LLM-driven Feature Engineering Empowered by Collaborative Bayesian Hyperparameter Optimization

Beicheng Xu , Keyao Ding , Wei Liu , Yupeng Lu , Bin Cui This is my paper

Pith reviewed 2026-05-22 10:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords feature engineeringhyperparameter optimizationlarge language modelsbayesian optimizationautomlcollaborative frameworktree of thought

0 comments

The pith

CoFEH interleaves LLM feature engineering with Bayesian hyperparameter optimization through mutual context sharing to improve joint AutoML outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoFEH as a way to overcome the separation of feature engineering and hyperparameter optimization in automated machine learning. Traditional approaches are limited by fixed search spaces without semantic understanding, and prior LLM methods often produce features independently before tuning models in sequence. CoFEH instead runs an LLM module using Tree of Thought reasoning to create flexible feature pipelines while a Bayesian optimizer handles model parameters, with a selector deciding the order of steps and a mechanism passing information between the two so each informs the other. If successful, this joint process finds combinations that sequential workflows miss and delivers stronger predictive models across tasks. A reader would care because better automated pipelines reduce manual trial-and-error in building effective machine learning systems.

Core claim

CoFEH is a collaborative framework that interleaves an LLM-driven feature engineering optimizer powered by Tree of Thought to explore flexible pipelines, a Bayesian optimization module to tune downstream model hyperparameters, and a dynamic selector that adaptively chooses which module to run next, all supported by a mutual conditioning mechanism that shares context so the LLM and Bayesian components make mutually informed decisions.

What carries the argument

The mutual conditioning mechanism, which shares context between the LLM-based feature engineering optimizer and the Bayesian hyperparameter optimization module to enable adaptive interleaving and capture FE-HPO interactions.

If this is right

The joint workflow captures interactions that greedy FE-then-HPO pipelines overlook, leading to higher final model performance.
The LLM component generates unbounded operators informed by semantic reasoning while the Bayesian module tunes parameters on the resulting features.
The dynamic selector allows the system to switch between feature engineering and hyperparameter steps as needed during a single run.
Experiments demonstrate outperformance over both traditional AutoML tools and prior LLM-only feature engineering methods in standalone and combined settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar mutual-conditioning ideas could be applied to couple feature engineering with model architecture search rather than just hyperparameter tuning.
On datasets where domain knowledge is scarce, the LLM's reasoning step may reduce reliance on human-designed feature transformations.
Scaling the approach to larger LLMs or different Bayesian acquisition functions could be tested to check whether the performance edge grows or saturates.

Load-bearing premise

That sharing context between the LLM and Bayesian modules is enough to let them make decisions that reliably exploit interactions between feature choices and hyperparameter settings.

What would settle it

A controlled comparison on the same benchmark tasks between full CoFEH and an ablated version that runs feature engineering and hyperparameter optimization without any context sharing, measuring whether accuracy or efficiency gains disappear.

read the original abstract

Feature Engineering (FE) is pivotal in automated machine learning (AutoML) but remains a bottleneck for traditional methods, which operate within rigid search spaces and lack domain awareness. While Large Language Models (LLMs) offer a promising alternative to generate unbounded operators with semantic reasoning, existing methods focus on isolated subtasks such as feature generation, falling short of free-form FE pipelines. Moreover, they are rarely coupled with hyperparameter optimization (HPO) of the downstream ML model, leading to greedy "FE-then-HPO" workflows that cannot capture strong FE-HPO interactions. In this paper, we present CoFEH, a collaborative framework that interleaves LLM-based FE and Bayesian HPO for robust end-to-end AutoML. CoFEH uses an LLM-driven FE optimizer powered by Tree of Thought (TOT) to explore flexible FE pipelines, a Bayesian optimization (BO) module to solve HPO, and a dynamic optimizer selector that adaptively interleaves FE and HPO steps. Crucially, we introduce a mutual conditioning mechanism that shares context between LLM and BO, enabling mutually informed decisions. Experiments show that CoFEH outperforms both traditional and LLM-based baselines in both standalone FE and joint FE+HPO settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoFEH interleaves LLM feature engineering with Bayesian HPO via dynamic selection and mutual conditioning, but gains may trace to unmatched compute budgets rather than the cross-module feedback.

read the letter

The main takeaway is that this paper introduces CoFEH, which interleaves LLM-based feature engineering using Tree of Thoughts with Bayesian hyperparameter optimization through a dynamic selector and a mutual conditioning mechanism. This aims to handle the interactions between features and hyperparameters that sequential methods miss. What is new here is the collaborative interleaving framework. Instead of generating features first and then tuning hyperparameters separately, the system shares context so the LLM can consider current hyperparameter settings and the optimizer can account for feature changes. The dynamic selector decides adaptively when to run FE or HPO steps. This extends prior work on isolated LLM feature generation or greedy FE-then-HPO pipelines. The paper does a good job explaining the limitations of rigid search spaces in traditional AutoML and how LLMs bring semantic reasoning to create more flexible pipelines. The joint optimization focus is a reasonable direction given how features and model params often depend on each other. Where it gets soft is in the experimental validation. The claims of outperformance in joint settings rely on the mutual conditioning capturing strong interactions, but if the total number of LLM calls and BO evaluations isn't matched across CoFEH and the baselines, the gains might come from extra budget rather than the informed decisions. The abstract doesn't detail the protocols or ablations for this, so the full paper needs to demonstrate that the improvements aren't just from running more steps. Soundness would be stronger with clear budget controls and error bars. This paper targets people working on AutoML systems that incorporate LLMs for pipeline automation. Readers interested in practical tools for complex datasets could pick up useful ideas from the framework, particularly the interleaving approach. It shows clear thinking on the problem even if the results need more verification. It deserves a serious referee to examine the experimental design and confirm whether the mutual conditioning adds value beyond additional compute.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes CoFEH, a collaborative framework that interleaves an LLM-driven feature engineering optimizer (powered by Tree of Thought for flexible pipelines) with a Bayesian optimization module for hyperparameter tuning. It introduces a mutual conditioning mechanism to share context between the LLM and BO components, along with a dynamic optimizer selector that adaptively interleaves FE and HPO steps. The central claim, supported by experiments, is that CoFEH outperforms both traditional and LLM-based baselines in standalone FE tasks and in joint FE+HPO settings by capturing strong interactions between feature engineering and model hyperparameters.

Significance. If the results hold under properly controlled budgets, the work could advance AutoML by demonstrating how semantic reasoning from LLMs can be productively coupled with optimization routines through mutual conditioning, moving beyond rigid search spaces and sequential FE-then-HPO pipelines. The emphasis on interaction capture via shared context is a potentially valuable direction for end-to-end automation.

major comments (1)

[Section 4 and experimental tables] Section 4 and the experimental tables: the claim that CoFEH outperforms baselines in joint FE+HPO settings rests on the mutual conditioning mechanism. If the protocol does not fix the total number of LLM calls and BO evaluations across CoFEH and the 'FE-then-HPO' baselines (or if the dynamic selector performs more total steps), observed gains could arise from unequal compute budgets rather than informed cross-module decisions. Per-method budgets must be reported and an ablation with matched total evaluations included to substantiate the central claim.

minor comments (1)

[Abstract and Section 3] The abstract and method description introduce the 'dynamic optimizer selector' without specifying its decision criteria or update rule; this should be formalized with pseudocode or equations for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address the major comment on the experimental budgets and controls in our point-by-point response below.

read point-by-point responses

Referee: [Section 4 and experimental tables] Section 4 and the experimental tables: the claim that CoFEH outperforms baselines in joint FE+HPO settings rests on the mutual conditioning mechanism. If the protocol does not fix the total number of LLM calls and BO evaluations across CoFEH and the 'FE-then-HPO' baselines (or if the dynamic selector performs more total steps), observed gains could arise from unequal compute budgets rather than informed cross-module decisions. Per-method budgets must be reported and an ablation with matched total evaluations included to substantiate the central claim.

Authors: We appreciate the referee pointing out this potential confound in our experimental setup. The manuscript describes a fixed overall budget for the AutoML process, but we agree that explicit reporting of the number of LLM calls and BO evaluations per method is necessary for clarity. In the revised version, we will add detailed tables or text in Section 4 reporting the exact counts for CoFEH and the baselines. Moreover, we will include an additional ablation study that enforces a strictly matched total number of evaluations (e.g., same number of LLM invocations and BO queries) across CoFEH and the sequential FE-then-HPO approach. This will allow us to isolate the benefit of the mutual conditioning and dynamic selector. We believe this revision will strengthen the evidence for our central claim. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on experiments

full rationale

The paper introduces an empirical framework CoFEH that interleaves LLM-driven feature engineering (via Tree of Thought) with Bayesian optimization for HPO, plus a mutual conditioning mechanism for context sharing. All central claims are grounded in experimental comparisons against traditional and LLM-based baselines in both standalone FE and joint FE+HPO settings. No equations, mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the design choices are presented as architectural proposals whose value is assessed via reported performance rather than reducing to definitional equivalence or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review limits visibility into specific parameters or assumptions; the framework implicitly relies on LLM semantic reasoning capabilities and standard Bayesian optimization assumptions without detailing free parameters or invented entities.

pith-pipeline@v0.9.0 · 5755 in / 985 out tokens · 37088 ms · 2026-05-22T10:50:50.297888+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CoFEH uses an LLM-driven FE optimizer powered by Tree of Thought (TOT) to explore flexible FE pipelines, a Bayesian optimization (BO) module to solve HPO, and a dynamic optimizer selector that adaptively interleaves FE and HPO steps. Crucially, we introduce a mutual conditioning mechanism...
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ a Predictor Upper Confidence Bound (PUCB) policy to dynamically decide which optimizer to execute at each step

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.