pith. sign in

arxiv: 2606.29082 · v1 · pith:WZQAJKA4new · submitted 2026-06-27 · 💻 cs.CL · cs.LG

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

Pith reviewed 2026-06-30 09:12 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords evolution fine-tuninglarge language modelsevolutionary searchoptimization taskscross-task generalizationfine-tuningdiscovery agents
0
0 comments X

The pith

Evolution fine-tuning teaches language models reusable strategies for solving new optimization tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Evolution Fine-Tuning to convert trajectories from evolutionary search into training data that lets LLMs learn general skills for iterating on solutions. These skills are meant to transfer across tasks instead of being rebuilt from scratch for each new problem. A dataset spanning 371 tasks in 10 domains is used to fine-tune models from 2B to 9B parameters. The resulting models show average gains of 10.22 percent on 22 held-out tasks and reach competitive results on specific open problems when combined with reinforcement learning at test time.

Core claim

Evolution Fine-Tuning turns evolutionary search trajectories collected across 371 tasks into supervised signals that allow language models to internalize transferable strategies for mutation, backtracking, and iteration, producing measurable gains on unseen optimization problems from the same distribution.

What carries the argument

Evolution Fine-Tuning (EFT), a mid-training procedure that converts full evolutionary search trajectories into next-token prediction targets so the model learns cross-task evolution behavior.

If this is right

  • Fine-tuned models improve by 10.22 percent on average over their untuned base versions across 22 held-out tasks.
  • Paired with test-time reinforcement learning, the models reach state-of-the-art results on two circle-packing problems.
  • The same models outperform their base counterparts on the Erdős minimum-overlap problem.
  • Language models can function as reusable discovery agents that accumulate evolution experience rather than resetting for each new task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach suggests that meta-level search policies can be separated from domain-specific knowledge and stored in model weights.
  • Similar trajectory-based fine-tuning could be applied to other iterative methods such as gradient-based or Monte-Carlo search to test whether the benefit is specific to evolutionary scaffolds.
  • If the dataset size grows, the same method might close gaps on long-standing open conjectures that currently require human-designed scaffolds.

Load-bearing premise

Trajectories produced by standard evolutionary scaffolds contain general signals about effective search steps that a model can extract and apply to entirely different tasks rather than memorizing task-specific patterns.

What would settle it

No average performance lift on a fresh collection of 20+ held-out optimization tasks drawn from domains outside the original 10 would falsify the claim that the fine-tuning produces transferable evolution skills.

read the original abstract

Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on optimization tasks, including open mathematical conjectures, GPU kernel design, scientific law discovery, and combinatorial puzzles. To achieve this, prior work applied search scaffolds to one target task at a time, so every new problem is approached from scratch and the experience accumulated during search is discarded once the model finishes its attempt. This leaves the capability of iteratively evolving a solution (e.g., knowing which part to mutate and how, deciding when to backtrack) entirely in the scaffold rather than in the model itself. Whether the model itself could acquire this capability and reuse it across different tasks has been largely unexamined. To address this, we introduce Evolution Fine-Tuning (EFT), a mid-training paradigm that teaches LLMs to evolve solutions across tasks by converting evolutionary search trajectories into supervision. We construct Finch Collection, a 156K-trajectory dataset spanning 10 domains and 371 optimization tasks, and fine-tune open-source LLMs from 2B to 9B parameters. Empirically, EFT confers cross-task generalization: across 22 held-out tasks, our models surpass their base counterparts by 10.22% on average. Furthermore, when paired with test-time RL, our model matches state-of-the-art performance on two circle-packing tasks and outperforms its base-model counterpart on the Erd\H{o}s minimum-overlap problem. EFT thus serves as a "practice phase" for general-purpose discovery agents that do not solve new problems from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Evolution Fine-Tuning (EFT), a mid-training method that converts 156K evolutionary search trajectories spanning 10 domains and 371 tasks (Finch Collection) into supervision for fine-tuning LLMs (2B–9B parameters). The central claim is that this teaches reusable evolutionary skills (mutation, backtracking, iteration) enabling cross-task generalization, with models surpassing base counterparts by 10.22% on average across 22 held-out tasks; when combined with test-time RL the fine-tuned models match SOTA on two circle-packing tasks and outperform the base model on the Erdős minimum-overlap problem.

Significance. If the central claim holds, the work would be significant for shifting iterative search capabilities from external scaffolds into the model itself, supporting more general-purpose discovery agents. The construction of a multi-domain trajectory dataset at this scale is a concrete contribution that could enable further research on strategy transfer.

major comments (2)
  1. [Abstract] Abstract: the 10.22% average gain on 22 held-out tasks is presented without any reported controls for domain overlap between the 10 training domains and the held-out tasks, statistical significance testing, or ablations that remove task-specific signals while retaining general evolutionary operators. This information is required to distinguish internalization of transferable strategies from memorization of patterns within the same domains.
  2. [Abstract and §4] Abstract and §4 (empirical evaluation): no description is given of performance measurement protocols, data exclusion rules, or how trajectories were filtered, making it impossible to assess whether the reported gains on held-out tasks and the circle-packing/Erdős results are robust or could arise from scaffold-specific artifacts.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief statement of the base model sizes and the exact held-out task domains to allow readers to gauge the degree of domain shift.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our empirical claims and protocols. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 10.22% average gain on 22 held-out tasks is presented without any reported controls for domain overlap between the 10 training domains and the held-out tasks, statistical significance testing, or ablations that remove task-specific signals while retaining general evolutionary operators. This information is required to distinguish internalization of transferable strategies from memorization of patterns within the same domains.

    Authors: We agree the abstract omits these elements. The full manuscript (§3.2) selects the 22 held-out tasks from domains disjoint from the 10 training domains to reduce overlap, but we acknowledge this is not explicitly controlled or ablated in the reported results. In revision we will add to §4: (i) explicit documentation of domain disjointness, (ii) statistical significance testing (paired t-tests with p-values) on the 10.22% average improvement, and (iii) an ablation that retains general evolutionary operators while removing task-specific signals. These additions will directly address the distinction between strategy transfer and memorization. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4 (empirical evaluation): no description is given of performance measurement protocols, data exclusion rules, or how trajectories were filtered, making it impossible to assess whether the reported gains on held-out tasks and the circle-packing/Erdős results are robust or could arise from scaffold-specific artifacts.

    Authors: The referee is correct that §4 lacks a consolidated description of these protocols. In the revision we will expand §4 with a dedicated subsection detailing: (i) performance measurement (success rate defined by objective improvement thresholds), (ii) data exclusion rules (e.g., discarding trajectories with syntax errors or non-convergent runs), and (iii) trajectory filtering criteria (minimum length, valid mutation rate, and convergence checks). This will enable assessment of robustness independent of scaffold artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: standard trajectory-supervised fine-tuning with held-out task evaluation

full rationale

The paper generates a 156K-trajectory dataset from evolutionary search scaffolds across 371 tasks in 10 domains, fine-tunes LLMs on this data, and reports average gains on 22 explicitly held-out tasks. This is a conventional train/test split in supervised learning; the held-out performance metric is not defined in terms of the training trajectories or scaffolds by construction. No equations, self-citations, or ansatzes are presented that reduce the central cross-task generalization claim to a tautology or fitted input. The load-bearing assumption (transferable strategies vs. task-specific patterns) is an empirical question tested by the held-out split rather than presupposed by the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that evolutionary trajectories encode generalizable discovery heuristics.

pith-pipeline@v0.9.1-grok · 5857 in / 1155 out tokens · 25098 ms · 2026-06-30T09:12:53.677719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 36 canonical work pages · 20 internal anchors

  1. [1]

    Optimization by simulated annealing

    Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983

  2. [2]

    Some remarks on number theory.Riveon Lematematika, 9:45–48, 1955

    Paul Erdős. Some remarks on number theory.Riveon Lematematika, 9:45–48, 1955

  3. [3]

    A new bound for erdős’ minimum overlap problem.Acta Arithmetica, 208: 235–255, 2023

    Ethan Patrick White. A new bound for erdős’ minimum overlap problem.Acta Arithmetica, 208: 235–255, 2023

  4. [4]

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?arXiv preprint arXiv:2502.10517, 2025

  5. [5]

    Llm-srbench: A new benchmark for scientific equation discovery with large language models.arXiv preprint arXiv:2504.10415, 2025

    Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, and Chandan K Reddy. Llm-srbench: A new benchmark for scientific equation discovery with large language models.arXiv preprint arXiv:2504.10415, 2025

  6. [6]

    Mathematicaldiscoveriesfromprogramsearchwithlargelanguagemodels.Nature,625(7995):468–475, 2024

    BernardinoRomera-Paredes,MohammadaminBarekatain,AlexanderNovikov,MatejBalog,MPawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematicaldiscoveriesfromprogramsearchwithlargelanguagemodels.Nature,625(7995):468–475, 2024

  7. [7]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wag- ner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  8. [8]

    ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

  9. [9]

    CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

    Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

  10. [10]

    Pacevolve: Enabling long-horizon progress-aware consistent evolution.arXiv preprint arXiv:2601.10657, 2026

    Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, et al. Pacevolve: Enabling long-horizon progress-aware consistent evolution.arXiv preprint arXiv:2601.10657, 2026

  11. [11]

    Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

  12. [12]

    Evox: Meta-evolutionforautomated discovery.arXiv preprint arXiv:2602.23413, 2026

    Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, AshwinNaren,EthanBoneh,AudreyCheng,MelissaZPan,etal. Evox: Meta-evolutionforautomated discovery.arXiv preprint arXiv:2602.23413, 2026

  13. [13]

    ThetaEvolve: Test-time Learning on Open Problems

    Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

  14. [14]

    Learning to Discover at Test Time

    Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026

  15. [15]

    OpenEvolve: An open-source evolutionary coding agent

    Asankhaya Sharma. OpenEvolve: An open-source evolutionary coding agent. https://github.com/ algorithmicsuperintelligence/openevolve, 2025. GitHub repository

  16. [16]

    Qwen3.5: Acceleratingproductivitywithnativemultimodalagents, February2026

    QwenTeam. Qwen3.5: Acceleratingproductivitywithnativemultimodalagents, February2026. URL https://qwen.ai/blog?id=qwen3.5

  17. [17]

    Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025

    Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025. 13 Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

  18. [18]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    LakshyaAAgrawal, ShangyinTan, DilaraSoylu, NoahZiems, RishiKhare, KristaOpsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

  19. [19]

    Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025

  20. [20]

    Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina

    Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026

  21. [21]

    CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

    Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

  22. [22]

    Frontiercs: Evolving challenges for evolving intelligence

    Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, et al. Frontiercs: Evolving challenges for evolving intelligence. arXiv preprint arXiv:2512.15699, 2025

  23. [23]

    Algotune: Can language models speed up general-purpose numerical programs?arXiv preprint arXiv:2507.15887, 2025

    Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, et al. Algotune: Can language models speed up general-purpose numerical programs?arXiv preprint arXiv:2507.15887, 2025

  24. [24]

    GPU MODE

    GPU MODE. GPU MODE. https://www.gpumode.com/home, 2026. Accessed: 2026-05-03

  25. [25]

    Malte D Luecken, Scott Gigante, Daniel B Burkhardt, Robrecht Cannoodt, Daniel C Strobl, Nikolay S Markov,LukeZappia,GiovanniPalla,WesleyLewis,DanielDimitrov,etal.Definingandbenchmarking open problems in single-cell analysis.Nature Biotechnology, 43(7):1035–1040, 2025

  26. [26]

    Semi-autonomous mathematics discovery with gemini: A case study on the erd\h{o}s problems.arXiv preprint arXiv:2601.22401, 2026

    Tony Feng, Trieu Trinh, Garrett Bingham, Jiwon Kang, Shengtong Zhang, Sang-hyun Kim, Kevin Barreto, Carl Schildkraut, Junehyuk Jung, Jaehyeon Seo, et al. Semi-autonomous mathematics discovery with gemini: A case study on the erd\h{o}s problems.arXiv preprint arXiv:2601.22401, 2026

  27. [27]

    Erdos problems

    Thomas Bloom. Erdos problems. https://www.erdosproblems.com/, 2026. Accessed: 2026-05-03

  28. [28]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  29. [29]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  30. [30]

    Llamafactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

  31. [31]

    Dimakis, Matei Zaharia, and Ion Stoica

    Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ramchandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. Skydiscover: A flexibleframeworkforai-drivenscientificandalgorithmicdiscovery,2026. URLhttps://...

  32. [32]

    Evaluation-driven Scaling for Scientific Discovery

    Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, et al. Evaluation-driven scaling for scientific discovery.arXiv preprint arXiv:2604.19341, 2026

  33. [33]

    Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta.arXiv preprint arXiv:2512.23236, 2025

    Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, et al. Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta.arXiv preprint arXiv:2512.23236, 2025

  34. [34]

    Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

    He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qinxiu Cheng, Xinchen Xie, Yicheng Chen, et al. Kernel-smith: A unified recipe for evolutionary kernel optimization. arXiv preprint arXiv:2603.28342, 2026. 14 Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

  35. [35]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

  36. [36]

    Gso: Challenging software optimization tasks for evaluating swe-agents.arXiv preprint arXiv:2505.23671, 2025

    Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. Gso: Challenging software optimization tasks for evaluating swe-agents.arXiv preprint arXiv:2505.23671, 2025

  37. [37]

    Autolab: Can models begin to participate in the loops that drive scientific and engineering progress?, 2026

    AutoLab Team. Autolab: Can models begin to participate in the loops that drive scientific and engineering progress?, 2026. URL https://github.com/autolabhq/autolab

  38. [38]

    Can language models discover scaling laws?arXiv preprint arXiv:2507.21184, 2025

    HaoweiLin, HaotianYe, WenzhengFeng, QuzheHuang, YujunLi, HubertLim, ZhengruiLi, Xiangyu Wang, Jianzhu Ma, Yitao Liang, et al. Can language models discover scaling laws?arXiv preprint arXiv:2507.21184, 2025

  39. [39]

    Theflancollection: Designingdataandmethodsforeffectiveinstruction tuning

    Shayne Longpre, LeHou, TuVu, AlbertWebson, HyungWonChung, YiTay, DennyZhou, Quoc VLe, BarretZoph,JasonWei,etal. Theflancollection: Designingdataandmethodsforeffectiveinstruction tuning. InInternational conference on machine learning, pages 22631–22648. PMLR, 2023

  40. [40]

    ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

    Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, et al. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows.arXiv preprint arXiv:2505.19897, 2025

  41. [41]

    Orion: Towards lab automation with computer-using agents.bioRxiv, pages 2026–06, 2026

    Chang Ma, Linh Trinh, Matt Bucci, Aviv Regev, and Hanchen Wang. Orion: Towards lab automation with computer-using agents.bioRxiv, pages 2026–06, 2026

  42. [42]

    Collavo: Crayonlargelanguage andvisionmodel

    Byung-KwanLee,BeomchanPark,ChaeWonKim,andYongManRo. Collavo: Crayonlargelanguage andvisionmodel. InFindingsoftheAssociationforComputationalLinguistics: ACL2024,pages1121–1138, 2024

  43. [43]

    Moai: Mixtureofallintelligence for large language and vision models

    Byung-KwanLee,BeomchanPark,ChaeWonKim,andYongManRo. Moai: Mixtureofallintelligence for large language and vision models. InEuropean Conference on Computer Vision, pages 273–302. Springer, 2024

  44. [44]

    Meteor: Mamba-basedtraversal of rationale for large language and vision models.Advances in Neural Information Processing Systems, 37:40278–40315, 2024

    Byung-KwanLee,ChaeWonKim,BeomchanPark,andYongManRo. Meteor: Mamba-basedtraversal of rationale for large language and vision models.Advances in Neural Information Processing Systems, 37:40278–40315, 2024

  45. [45]

    Phantom of latent for large language and vision models.arXiv preprint arXiv:2409.14713, 2024

    Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, and Yong Man Ro. Phantom of latent for large language and vision models.arXiv preprint arXiv:2409.14713, 2024

  46. [46]

    Trol: Traversal of layers for large language and vision models.arXiv preprint arXiv:2406.12246, 2024

    Byung-KwanLee,SangyunChung,ChaeWonKim,BeomchanPark,andYongManRo. Trol: Traversal of layers for large language and vision models.arXiv preprint arXiv:2406.12246, 2024

  47. [47]

    GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

    Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, and Yueh-Hua Wu. Genrecal: Generation after recalibration from large to small vision-language models.arXiv preprint arXiv:2506.15681, 2025

  48. [48]

    Vlsi: Verbalized layers-to-interactions from large to small vision language models

    Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers-to-interactions from large to small vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29545–29557, 2025

  49. [49]

    Unified reinforce- ment and imitation learning for vision-language models.Advances in Neural Information Processing Systems, 38:156508–156534, 2026

    Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Frank Wang, and Yueh-Hua Wu. Unified reinforce- ment and imitation learning for vision-language models.Advances in Neural Information Processing Systems, 38:156508–156534, 2026

  50. [50]

    Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

    Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, and Byung-Kwan Lee. Agent explorative policy optimization for multimodal agentic reasoning.arXiv preprint arXiv:2605.28774, 2026

  51. [51]

    Masking teacher and reinforcing student for distilling vision-language models

    Byung-Kwan Lee, Yu-Chiang Frank Wang, and Ryo Hachiuma. Masking teacher and reinforcing student for distilling vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10126–10141, 2026. 15 Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

  52. [52]

    Recursive think-answer process for llms and vlms

    Byung-Kwan Lee, Youngchae Chee, and Yong Man Ro. Recursive think-answer process for llms and vlms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9608–9621, 2026

  53. [53]

    SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

    Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, et al. Spatialclaw: Rethinking action interface for agentic spatial reasoning.arXiv preprint arXiv:2606.13673, 2026

  54. [54]

    Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

    Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, and Jeany Son. Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation.arXiv preprint arXiv:2605.11651, 2026

  55. [55]

    Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, and Chanyoung Park. Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding.arXiv preprint arXiv:2604.12358, 2026

  56. [56]

    Dialogcc: An automated pipeline for creating high-quality multi-modal dialogue dataset

    Young-Jun Lee, Byungsoo Ko, Han-Gyu Kim, Jonghwan Hyeon, and Ho-Jin Choi. Dialogcc: An automated pipeline for creating high-quality multi-modal dialogue dataset. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1938–1963, 2024

  57. [57]

    Stark: Social long-term multi-modal conversation with persona commonsense knowledge

    Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeong-Jin Oh, Byungsoo Ko, Jonghwan Hyeon, and Ho-Jin Choi. Stark: Social long-term multi-modal conversation with persona commonsense knowledge. InFindingsoftheAssociationforComputationalLinguistics: EMNLP2024,pages12137–12162, 2024

  58. [58]

    Thanos: Enhancing conversationalagentswithskill-of-mind-infusedlargelanguagemodel.arXivpreprintarXiv:2411.04496, 2024

    Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, and Ho-Jin Choi. Thanos: Enhancing conversationalagentswithskill-of-mind-infusedlargelanguagemodel.arXivpreprintarXiv:2411.04496, 2024

  59. [59]

    Large language models can share images, too! InFindings of the Association for Computational Linguistics: ACL 2024, pages 692–713, 2024

    Young-Jun Lee, Dokyong Lee, Joo-won Sung, Jonghwan Hyeon, and Ho-Jin Choi. Large language models can share images, too! InFindings of the Association for Computational Linguistics: ACL 2024, pages 692–713, 2024

  60. [60]

    Multiverse: A multi-turn conversation benchmarkforevaluatinglargevisionandlanguagemodels

    Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao, Xuankun Rong, Eojin Joo, Seung-Ho Han, et al. Multiverse: A multi-turn conversation benchmarkforevaluatinglargevisionandlanguagemodels. InProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pages 708–719, 2025

  61. [61]

    Refinebench: Evaluating refinement capability of language models via checklists.arXiv preprint arXiv:2511.22173, 2025

    Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, and Ho-Jin Choi. Refinebench: Evaluating refinement capability of language models via checklists.arXiv preprint arXiv:2511.22173, 2025

  62. [62]

    On the origin of species

    Charles Darwin. On the origin of species. InScientific Methodology in Nineteenth Century Britain, pages 133–181. Routledge, 2025

  63. [63]

    Unpredictable evolution in a 30-year study of darwin’s finches

    Peter R Grant and B Rosemary Grant. Unpredictable evolution in a 30-year study of darwin’s finches. science, 296(5568):707–711, 2002

  64. [64]

    C1 mismatch: reported X, computed Y

    Lev Vygotsky et al.Interaction between learning and development. Linköpings universitet Linköping, Sweden, 2011. 16 Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks A. Broader Impacts EFT democratizes LLM-driven discovery by transferring optimization capabilities from expensive proprietary models to small open-weight models, reduc...