pith. sign in

arxiv: 2510.25223 · v3 · submitted 2025-10-29 · 💻 cs.AI

FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data

Pith reviewed 2026-05-18 03:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords feature engineeringmulti-agent systemslarge language modelsevent log dataevolutionary algorithmsindustrial dataautomated machine learning
0
0 comments X

The pith

FELA uses multiple LLM agents to evolve explainable features from industrial event logs that improve model results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes FELA as a multi-agent system that lets large language models collaborate to build features for complex event log data from industrial systems. Agents propose ideas, write code for them, critique the results, and learn from feedback to improve over successive rounds through a shared knowledge base. A reader would care because event logs are rich but hard to turn into useful inputs for models due to their size and mixed structures, so successful automation could shrink the human time needed while keeping the created features understandable to experts in the domain.

Core claim

FELA integrates the reasoning and coding capabilities of large language models with an insight-guided self-evolution paradigm in which specialized agents generate, validate, and implement novel feature ideas, an Evaluation Agent summarizes feedback to update a hierarchical knowledge base and dual-memory system, and an agentic evolution algorithm balances exploration and exploitation to enable continual improvement on heterogeneous industrial event logs.

What carries the argument

The collaboration of Idea Agents, Code Agents, Critic Agents, and Evaluation Agent together with the agentic evolution algorithm that combines reinforcement learning and genetic algorithm principles to drive self-evolution across the space of feature ideas.

If this is right

  • FELA produces explainable and domain-relevant features from complex event logs.
  • Model performance improves on real industrial datasets with less manual feature work.
  • The system handles large scale, high dimensionality, diverse data types, and temporal or relational structures.
  • An agentic evolution process allows ongoing adaptation without rigid predefined operations.
  • The overall setup offers a general framework for automated and interpretable feature engineering in real-world environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent collaboration pattern could be tried on event data from non-industrial sources such as web analytics or sensor streams.
  • Over many evolution cycles the system might surface feature combinations that human engineers overlook because they cross multiple data types.
  • Embedding FELA outputs directly into existing machine learning pipelines would test how far the reduction in manual effort can extend.

Load-bearing premise

The multi-agent LLM system can reliably produce novel, valid, and superior features through agent collaboration and self-evolution without requiring substantial human oversight or post-hoc fixes.

What would settle it

A side-by-side test on held-out real industrial event log datasets in which models using features produced by FELA show no gain in predictive accuracy or interpretability compared with features created by human experts or standard automatic methods.

Figures

Figures reproduced from arXiv: 2510.25223 by Dong Fang, Haoyu Wang, Kun Ouyang.

Figure 1
Figure 1. Figure 1: Human data scientists explore existing ideas to derive features, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System overview of FELA. The red arrows illustrate the core self-evolution loop of the system. The idea agents propose new feature concepts or experimental ideas, which are then translated into executable feature engineering code by the code agents. The generated code is executed and evaluated by the evaluate agent using real event log data to produce corresponding performance rewards. The resulting experi… view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of the knowledge base. It contains rich information including idea insights, ucb scores, associated feature and corresponding pseudocodes, etc. distinct interpretations correspond to different features of the same high-level insight. Together, the ideas and their fea￾tures form a two-layer hierarchical structure, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Thinking paradigm of our agents. All agents follow this reasoning framework unless otherwise stated. and (iii) creative—i.e., distinct from existing implementations under the same idea. Formally, di,j+1 = Afeature(Ms,Ml , Ii , H) (3) The feature implementation d is represented as a tuple (reason,summary, pseudocode). The reason field allows the idea agent to explicitly articulate the rationale behind propo… view at source ↗
Figure 5
Figure 5. Figure 5: Long Term Memory. Long term memory is updated in an adaptive [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Short Term Memory. The related ideas are retrieved in using RAG. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: AUC comparison across different classifiers. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study of the generated feature engineering codes [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation study in FELA. The cumulative max AUC during the [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: A snapshot of the knowledge database during evolution. It is a super [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
read the original abstract

Event log data, recording fine-grained user actions and system events, represent one of the most valuable assets for modern digital services. However, the complexity and heterogeneity of industrial event logs--characterized by large scale, high dimensionality, diverse data types, and intricate temporal or relational structures--make feature engineering extremely challenging. Existing automatic feature engineering approaches, such as AutoML or genetic methods, often suffer from limited explainability, rigid predefined operations, and poor adaptability to complicated heterogeneous data. In this paper, we propose FELA (Feature Engineering LLM Agents), a multi-agent evolutionary system that autonomously extracts meaningful and high-performing features from complex industrial event log data. FELA integrates the reasoning and coding capabilities of large language models (LLMs) with an insight-guided self-evolution paradigm. Specifically, FELA employs specialized agents--Idea Agents, Code Agents, and Critic Agents--to collaboratively generate, validate, and implement novel feature ideas. An Evaluation Agent summarizes feedback and updates a hierarchical knowledge base and dual-memory system to enable continual improvement. Moreover, FELA introduces an agentic evolution algorithm, combining reinforcement learning and genetic algorithm principles to balance exploration and exploitation across the idea space. Extensive experiments on real industrial datasets demonstrate that FELA can generate explainable, domain-relevant features that significantly improve model performance while reducing manual effort. Our results highlight the potential of LLM-based multi-agent systems as a general framework for automated, interpretable, and adaptive feature engineering in complex real-world environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes FELA, a multi-agent evolutionary system that uses LLMs to autonomously perform feature engineering on complex industrial event log data. It employs Idea, Code, Critic, and Evaluation agents that collaborate to generate, validate, and refine features, supported by an insight-guided self-evolution paradigm with a hierarchical knowledge base, dual-memory system, and an agentic evolution algorithm combining reinforcement learning and genetic algorithms. The central claim is that this approach produces explainable, domain-relevant features that significantly improve downstream model performance while reducing manual effort, as demonstrated through extensive experiments on real industrial datasets.

Significance. If the empirical claims are substantiated with detailed metrics and protocols, the work could meaningfully advance automated feature engineering for heterogeneous, high-dimensional industrial data by demonstrating how multi-agent LLM systems can incorporate reasoning, self-evolution, and explainability beyond rigid AutoML or genetic baselines. The combination of RL+GA principles with agent feedback loops represents a potentially generalizable framework for adaptive, interpretable data preprocessing in real-world settings.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'extensive experiments on real industrial datasets demonstrate that FELA can generate explainable, domain-relevant features that significantly improve model performance' provides no quantitative metrics, baselines, statistical tests, dataset characteristics, or evaluation protocols. This absence directly undermines verification of the primary performance and effort-reduction claims.
  2. [System Description / Experiments] System and Experiments sections: The self-evolution loop (via LLM summarization of Critic/Evaluation feedback into the hierarchical knowledge base and dual-memory) is presented as enabling reliable, low-oversight improvement. However, no data are reported on per-iteration invalid code rates, types of errors encountered (e.g., temporal aggregation or type mismatches on mixed industrial logs), or total human intervention hours required to produce executable features. These quantities are load-bearing for the 'reducing manual effort' component of the central claim.
minor comments (2)
  1. [Abstract] The abstract and system overview would benefit from a concise statement of the number of industrial datasets used and the specific downstream tasks (e.g., prediction targets) to allow readers to gauge scope.
  2. [Methods] Notation for the dual-memory system and hierarchical knowledge base should be introduced with explicit definitions or pseudocode early in the methods to improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments that help strengthen the empirical presentation of our work. We address each major comment point by point below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'extensive experiments on real industrial datasets demonstrate that FELA can generate explainable, domain-relevant features that significantly improve model performance' provides no quantitative metrics, baselines, statistical tests, dataset characteristics, or evaluation protocols. This absence directly undermines verification of the primary performance and effort-reduction claims.

    Authors: We agree that the abstract would be more informative with explicit quantitative support. The detailed metrics, baselines (including AutoML and genetic methods), statistical tests, dataset characteristics, and evaluation protocols are provided in the Experiments section. In the revised manuscript we will update the abstract to concisely include key results such as average performance gains over baselines and the number of real industrial datasets evaluated. revision: yes

  2. Referee: [System Description / Experiments] System and Experiments sections: The self-evolution loop (via LLM summarization of Critic/Evaluation feedback into the hierarchical knowledge base and dual-memory) is presented as enabling reliable, low-oversight improvement. However, no data are reported on per-iteration invalid code rates, types of errors encountered (e.g., temporal aggregation or type mismatches on mixed industrial logs), or total human intervention hours required to produce executable features. These quantities are load-bearing for the 'reducing manual effort' component of the central claim.

    Authors: The self-evolution mechanism is designed to reduce oversight through iterative agent feedback and knowledge base updates. While the manuscript describes this process, we did not report granular per-iteration invalid code rates or exact human intervention hours. We will add a new paragraph in the Experiments section discussing observed feature generation success rates, common error categories encountered (such as temporal and type mismatches), and a qualitative account of the limited human review steps required. However, systematic logging of per-iteration invalid rates and precise total human hours was not performed in the original experiments. revision: partial

standing simulated objections not resolved
  • Exact per-iteration invalid code rates and total human intervention hours, as these were not systematically recorded during the original experimental runs.

Circularity Check

0 steps flagged

No circularity: engineering framework evaluated on external data

full rationale

The paper presents FELA as a multi-agent LLM system with Idea/Code/Critic/Evaluation agents, a hierarchical knowledge base, dual-memory updates, and an RL+GA evolution loop. Performance claims rest on experiments with real industrial datasets rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs. No derivation chain exists that collapses by construction; the system description and empirical results are independent of the target claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests primarily on the unverified assumption that current LLMs can perform reliable feature ideation and coding for heterogeneous logs; no explicit free parameters or invented physical entities are introduced in the abstract.

axioms (1)
  • domain assumption Large language models possess sufficient reasoning and coding capabilities to generate and implement effective feature engineering ideas for complex event data.
    The entire agent collaboration and evolution process depends on this capability without stated limitations or fallback mechanisms.
invented entities (1)
  • Insight-guided self-evolution paradigm with dual-memory system and hierarchical knowledge base no independent evidence
    purpose: To enable continual improvement of feature ideas across iterations via feedback summarization.
    This component is introduced as part of FELA without external evidence or falsifiable predictions provided in the abstract.

pith-pipeline@v0.9.0 · 5795 in / 1424 out tokens · 40826 ms · 2026-05-18T03:57:47.359651+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 4 internal anchors

  1. [1]

    State of Data Science and Machine Learning 2021,

    Kaggle, “State of Data Science and Machine Learning 2021,” Web Page,

  2. [2]

    Available: https://www.kaggle.com/kaggle-survey-2021

    [Online]. Available: https://www.kaggle.com/kaggle-survey-2021

  3. [3]

    Progressive neural architecture search,

    C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 19–34

  4. [4]

    Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data

    D. Qi, J. Peng, Y . He, and J. Wang, “Auto-fp: An experimental study of automated feature preprocessing for tabular data,”arXiv preprint arXiv:2310.02540, 2023

  5. [5]

    Regularized evolution for image classifier architecture search,

    E. Real, A. Aggarwal, Y . Huang, and Q. V . Le, “Regularized evolution for image classifier architecture search,” inProceedings of the aaai conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 4780– 4789

  6. [6]

    Feature engineering for predictive modeling using reinforcement learning,

    U. Khurana, H. Samulowitz, and D. Turaga, “Feature engineering for predictive modeling using reinforcement learning,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  7. [7]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning,

    R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine learning, vol. 8, no. 3, pp. 229–256, 1992

  8. [8]

    Ai-researcher: Autonomous scientific innovation,

    J. Tang, L. Xia, Z. Li, and C. Huang, “Ai-researcher: Autonomous scientific innovation,”Neurips, 2025, in press

  9. [10]

    Mm-agent: Llm as agents for real-world mathematical modeling problem,

    F. Liu, Z. Yang, C. Liu, T. Song, X. Gao, and H. Liu, “Mm-agent: Llm as agents for real-world mathematical modeling problem,”Neurips, 2025, in press

  10. [11]

    Llm-srbench: A new benchmark for scientific equation discovery with large language models,

    P. Shojaee, N.-H. Nguyen, K. Meidani, A. B. Farimani, K. D. Doan, and C. K. Reddy, “Llm-srbench: A new benchmark for scientific equation discovery with large language models,”ICML, 2025, in press

  11. [12]

    Dynamic and adaptive feature generation with llm,

    X. Zhang, J. Zhang, B. Rekabdar, Y . Zhou, P. Wang, and K. Liu, “Dynamic and adaptive feature generation with llm,”arXiv preprint arXiv:2406.03505, 2024, in press

  12. [13]

    LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

    N. Abhyankar, P. Shojaee, and C. K. Reddy, “Llm-fe: Automated feature engineering for tabular data with llms as evolutionary optimizers,”arXiv preprint arXiv:2503.14434, 2025

  13. [14]

    Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation,

    Z. Zhang, C. Wang, Y . Wang, E. Shi, Y . Ma, W. Zhong, J. Chen, M. Mao, and Z. Zheng, “Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 481–503, 2025, in press

  14. [15]

    Exploring and evaluating hallucinations in llm-powered code generation,

    F. Liu, Y . Liu, L. Shi, H. Huang, R. Wang, Z. Yang, L. Zhang, Z. Li, and Y . Ma, “Exploring and evaluating hallucinations in llm-powered code generation,”arXiv preprint arXiv:2404.00971, 2024, unpublished

  15. [16]

    Principles of categorization,

    E. Rosch, “Principles of categorization,” inCognition and categoriza- tion. Routledge, 2024, pp. 27–48

  16. [17]

    The role of hierarchical knowledge representation in de- cisionmaking and system management,

    J. Rasmussen, “The role of hierarchical knowledge representation in de- cisionmaking and system management,”IEEE Transactions on systems, man, and cybernetics, no. 2, pp. 234–243, 2012

  17. [18]

    Large language models for automated data science: Introducing caafe for context-aware automated feature engineering,

    N. Hollmann, S. M ¨uller, and F. Hutter, “Large language models for automated data science: Introducing caafe for context-aware automated feature engineering,” inAdvances in Neural Information Processing Systems, 2024

  18. [19]

    Deepfm: A factorization- machine based neural network for ctr prediction,

    H. Guo, R. Tang, Y . Ye, Z. Li, and X. He, “Deepfm: A factorization- machine based neural network for ctr prediction,” inProceedings of the 26th International Joint Conference on Artificial Intelligence, 2017

  19. [20]

    Representation learning: A review and new perspectives,

    Y . Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013

  20. [21]

    Toward causal representation learning,

    B. Sch ¨olkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y . Bengio, “Toward causal representation learning,” Proceedings of the IEEE, vol. 109, no. 5, pp. 612–634, 2021

  21. [22]

    Learning feature engineering for classification,

    F. Nargesian, H. Samulowitz, U. Khurana, E. B. Khalil, and D. S. Turaga, “Learning feature engineering for classification,” inProceedings of the 26th International Joint Conference on Artificial Intelligence, 2017

  22. [23]

    An overview on data representation learning: From traditional feature learning to recent deep learning,

    G. Zhong, L.-N. Wang, X. Ling, and J. Dong, “An overview on data representation learning: From traditional feature learning to recent deep learning,”The Journal of Finance and Data Science, vol. 2, no. 4, pp. 265–278, 2016

  23. [24]

    A state- of-the-art review in big data management engineering: Real-life case studies, challenges, and future research directions,

    L. Theodorakopoulos, A. Theodoropoulou, and Y . Stamatiou, “A state- of-the-art review in big data management engineering: Real-life case studies, challenges, and future research directions,”Eng, vol. 5, no. 3, pp. 1266–1297, 2024

  24. [25]

    Automatic feature engineering from very high dimensional event logs using deep neural networks,

    K. Hu, J. Wang, Y . Liu, and D. Chen, “Automatic feature engineering from very high dimensional event logs using deep neural networks,” in Proceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data, 2019, pp. 1–9

  25. [26]

    Cognito: Automated feature engineering for supervised learning,

    U. Khurana, D. Turaga, H. Samulowitz, and S. Parthasrathy, “Cognito: Automated feature engineering for supervised learning,” in2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016, pp. 1304–1307

  26. [27]

    The autofeat python library for automated feature engineering and selection,

    F. Horn, R. Pack, and M. Rieger, “The autofeat python library for automated feature engineering and selection,” inECML PKDD 2019 Workshops, 2020

  27. [28]

    Deep feature synthesis: Towards automating data science endeavors,

    J. M. Kanter and K. Veeramachaneni, “Deep feature synthesis: Towards automating data science endeavors,” in2015 IEEE International Con- ference on Data Science and Advanced Analytics (DSAA), 2015

  28. [29]

    Feature engineering for predictive modeling using reinforcement learning,

    U. Khurana, H. Samulowitz, and D. Turaga, “Feature engineering for predictive modeling using reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, 2018

  29. [30]

    Automatic feature engineering by deep reinforcement learning,

    J. Zhang, J. Hao, F. Fogelman-Souli ´e, and Z. Wang, “Automatic feature engineering by deep reinforcement learning,” inProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019

  30. [31]

    Openfe: Automated feature generation with expert-level performance,

    T. Zhang, Z. A. Zhang, Z. Fan, H. Luo, F. Liu, Q. Liu, W. Cao, and L. Jian, “Openfe: Automated feature generation with expert-level performance,” inInternational Conference on Machine Learning, 2023, pp. 41 880–41 901

  31. [32]

    Lan- guage models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, and e. a. Askell, Amanda, “Lan- guage models are few-shot learners,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901

  32. [33]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, and e. a. Zhou, Denny, “Chain-of-thought prompting elicits reasoning in large language models,”Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022

  33. [34]

    Self-refine: Iterative refinement with self-feedback,

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, and e. a. Wiegreffe, Sarah, “Self-refine: Iterative refinement with self-feedback,” Advances in Neural Information Processing Systems, vol. 36, 2024

  34. [35]

    Language models can teach themselves to program better,

    P. Haluptzok, M. Bowers, and A. T. Kalai, “Language models can teach themselves to program better,”arXiv preprint arXiv:2207.14502, 2022

  35. [36]

    Language model crossover: Variation through few-shot prompting,

    E. Meyerson, M. J. Nelson, H. Bradley, A. Gaier, A. Moradi, A. K. Hoover, and J. Lehman, “Language model crossover: Variation through few-shot prompting,” inACM Transactions on Evolutionary Learning, vol. 4, no. 4, 2024, pp. 1–40

  36. [37]

    Evolution through large models,

    J. Lehman, J. Gordon, S. Jain, K. Ndousse, C. Yeh, and K. O. Stanley, “Evolution through large models,” inHandbook of Evolutionary Machine Learning. Springer, 2023, pp. 331–366

  37. [38]

    Large Lan- guage Models to Enhance Bayesian Optimization,

    T. Liu, N. Astorga, N. Seedat, and M. van der Schaar, “Large lan- guage models to enhance bayesian optimization,” inarXiv preprint arXiv:2402.03921, 2024

  38. [39]

    Evolutionary Computation in the Era of Large Language Model: Survey and Roadmap , journal =

    X. Wu, S.-h. Wu, J. Wu, L. Feng, and K. C. Tan, “Evolutionary computation in the era of large language models: Survey and roadmap,” inarXiv preprint arXiv:2401.10034, 2024

  39. [40]

    Large language models as evolution strategies,

    R. Lange, Y . Tian, and Y . Tang, “Large language models as evolution strategies,” inGenetic and Evolutionary Computation Conference Com- panion, 2024, pp. 579–582

  40. [41]

    EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

    Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian, and Y . Yang, “Connecting large language models with evolution- ary algorithms yields powerful prompt optimizers,”arXiv preprint arXiv:2309.08532, 2023

  41. [42]

    Large language models as optimizers,

    C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen, “Large language models as optimizers,” inThe Twelfth International Conference on Learning Representations

  42. [43]

    Evoprompting: Language models for code-level neural architecture search,

    A. Chen, D. Dohan, and D. So, “Evoprompting: Language models for code-level neural architecture search,”Advances in Neural Information Processing Systems, vol. 36, 2024

  43. [44]

    Zheng, X

    M. Zheng, X. Su, S. You, F. Wang, C. Qian, C. Xu, and S. Al- banie, “Can gpt-4 perform neural architecture search?”arXiv preprint arXiv:2304.10970, 2023

  44. [45]

    Shojaee, K

    P. Shojaee, K. Meidani, S. Gupta, A. B. Farimani, and C. K. Reddy, “Llm-sr: Scientific equation discovery via programming with large language models,”arXiv preprint arXiv:2404.18400, 2024

  45. [46]

    Mathematical discoveries from program search with large language models,

    B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, and e. a. Fawzi, Omar, “Mathematical discoveries from program search with large language models,”Nature, vol. 625, no. 7995, pp. 468–475, 2024

  46. [47]

    Optimized feature generation for tabular data via llms with decision tree reasoning,

    J. Nam, K. Kim, S. Oh, J. Tack, J. Kim, and J. Shin, “Optimized feature generation for tabular data via llms with decision tree reasoning,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 92 352– 92 380, 2024

  47. [48]

    Evolutionary large language model for automated feature transformation,

    N. Gong, C. K. Reddy, W. Ying, H. Chen, and Y . Fu, “Evolutionary large language model for automated feature transformation,” inProceedings of the AAAI conference on artificial intelligence, vol. 39, no. 16, 2025, pp. 16 844–16 852

  48. [49]

    Large language models can automatically engineer features for few-shot tabular learning,

    S. Han, J. Yoon, S. O. Arik, and T. Pfister, “Large language models can automatically engineer features for few-shot tabular learning,” in International Conference on Machine Learning. PMLR, 2024, pp. 17 454–17 479

  49. [50]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022, in press

  50. [51]

    Reflex- ion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: Language agents with verbal reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 8634–8652, 2023

  51. [52]

    A real-world webagent with planning, long context understanding, and program synthesis,

    I. Gur, H. Furuta, A. V . Huang, M. Safdari, Y . Matsuo, D. Eck, and A. Faust, “A real-world webagent with planning, long context understanding, and program synthesis,” inICLR, 2024

  52. [53]

    Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay,

    Z. Tang, Z. Chen, J. Yang, J. Mai, Y . Zheng, K. Wang, J. Chen, and L. Lin, “Alphaagent: Llm-driven alpha mining with regularized exploration to counteract alpha decay,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 2813–2822

  53. [54]

    Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis.arXiv preprint arXiv:2509.13782, 2025

    Y . Ge, L. Xie, Z. Li, Y . Pei, and T. Zhang, “Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis,”arXiv preprint arXiv:2509.13782, 2025

  54. [55]

    Context rot: How increasing input tokens impacts llm performance,

    K. Hong, A. Troynikov, and J. Huber, “Context rot: How increasing input tokens impacts llm performance,” Chroma, Tech. Rep., July 2025. [Online]. Available: https://research.trychroma.com/context-rot

  55. [56]

    Diabetes health indicators dataset,

    A. Teboul, “Diabetes health indicators dataset,” https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators- dataset, 2022

  56. [57]

    Ad svr prediction data on taobao.com,

    “Ad svr prediction data on taobao.com,” https://tianchi.aliyun.com/dataset/147588, 2018

  57. [58]

    Large language models can automatically engineer features for few-shot tabular learning,

    S. Han, J. Yoon, S. O. Arik, and T. Pfister, “Large language models can automatically engineer features for few-shot tabular learning,”arXiv preprint arXiv:2404.09491, 2024

  58. [59]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024