OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

Haochen Yang; Hong Qian; Ke Zhao; Mengyuan Ma; Xiangfeng Wang; Xingyu Lu

arxiv: 2605.29829 · v1 · pith:B7D5GI6Onew · submitted 2026-05-28 · 💻 cs.AI · cs.LG

OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

Haochen Yang , Ke Zhao , Mengyuan Ma , Xingyu Lu , Xiangfeng Wang , Hong Qian This is my paper

Pith reviewed 2026-06-29 07:04 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords optimization modelinglarge language modelsskill learningproblem archetypescluster-based distillationgeneralizationnatural language to optimizationautomated solving

0 comments

The pith

OptSkills clusters optimization problems by archetypes and distills modeling trajectories into reusable skills to improve generalization across problem types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OptSkills as a system that uses large language models to formulate and solve optimization problems from natural language descriptions. Current approaches remain limited because they react to superficial wording changes, reuse experience only at the level of individual cases, and fail to adapt when problem types shift or appear for the first time. OptSkills instead groups problems according to their underlying archetypes, explores multiple modeling and solver choices inside each group, and converts successful sequences into reusable workflow-level skills. These skills are then refined or expanded when new trajectories become available. The resulting system reports a micro-averaged accuracy of 68.27 percent across diverse benchmarks and 26.91 percent on the challenging MIPLIB-NL set.

Core claim

OptSkills is an archetype-centric skill learning and reasoning agent system. It clusters problems by underlying archetypes rather than surface narratives, explores diverse modeling paradigms and solver configurations within each cluster, distills successful trajectories into reusable workflow-level skills, and refines or expands the skill library for out-of-distribution cases. This produces state-of-the-art micro-averaged accuracy of 68.27 percent on datasets with diverse problem types, 26.91 percent on MIPLIB-NL (outperforming DeepSeek-V3.2-Thinking by 4.53 percent), and 72.79 percent on the OOD NLCO benchmark after skill learning on Nano-CO.

What carries the argument

Archetype-centric clustering of problems followed by cluster-based distillation of successful modeling and solver trajectories into reusable workflow-level skills.

If this is right

Skills distilled from one cluster can be applied or refined when new problem instances appear inside or outside the original distribution.
Exploration of multiple modeling paradigms and solver configurations inside each archetype cluster improves in-distribution accuracy.
The skill library can be expanded incrementally without retraining the entire system when new trajectories are obtained.
Performance on large-scale high-dimensional benchmarks such as MIPLIB-NL increases because archetype-level patterns reduce sensitivity to narrative variation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering-plus-distillation pattern could be tested on other structured reasoning domains where surface descriptions vary but core structures recur.
If archetype definitions are made explicit, the approach might allow systematic comparison of skill libraries across independent research groups.
The reported gains on MIPLIB-NL suggest the method may be especially useful for problems whose mathematical structure is stable but whose natural-language presentations differ widely.

Load-bearing premise

Clustering problems by underlying archetypes rather than surface narratives produces groups whose modeling and solver trajectories can be distilled into skills that generalize to shifted or emerging problem types.

What would settle it

A controlled experiment in which archetype-based clustering is replaced by surface-narrative clustering or no clustering and the accuracy on the OOD NLCO benchmark after Nano-CO skill learning falls below 72.79 percent or matches the non-archetype baseline.

Figures

Figures reproduced from arXiv: 2605.29829 by Haochen Yang, Hong Qian, Ke Zhao, Mengyuan Ma, Xiangfeng Wang, Xingyu Lu.

**Figure 1.** Figure 1: Embedding-space comparison on NANO-CO. Each point represents a problem instance, colored by manually verified problem-archetype labels. Top panels show t-SNE projections of raw problem-text embeddings and archetype embeddings; bottom panels report nearest-neighbor retrieval results in Hit@1 and MAP@5. Details are provided in Appendix F.1. problems has necessitated reliance on domain experts to formulate… view at source ↗

**Figure 2.** Figure 2: An overview of OPTSKILLS. OPTSKILLS consists of three phases. Phase I: (1) Problem Archetype Representation, (2) Solver Portfolio Rollout, (3) Trajectory Analysis, and (4) Cluster-Based Skill Distillation build the initial skill library M0. Phase II: for a new problem, skill selector routes it to (5) Skill Refinement or (6) Skill Expansion, yielding the learned skill library M∗ . Phase III: the selected sk… view at source ↗

**Figure 3.** Figure 3: Taxonomy of the synthetic data. The dataset contains 49 combinatorial optimization problem types grouped into seven families. Each leaf node represents one problem type. benchmark [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: NLCO task-category coverage by NANOCO exact or close counterparts. Regarding instance scale, the number of constraints in NANO-CO ranges from 1 to 5, with an average of 2.8, and the number of decision variables ranges from 9 to 310, with an average of 48. B Experiment Setup B.1 Configuration of OPTSKILLS Throughout all experiments, DeepSeek-V3.2 (nonthinking mode) serves as the uniform underlying LLM for… view at source ↗

**Figure 5.** Figure 5: An example of the optimization skill. E Baselines E.1 Reproduction of Chain-of-Experts Chain-of-Experts (CoE) is a multi-agent framework for solving operations research problems, where a conductor coordinates multiple specialized experts for terminology interpretation, mathematical modeling, programming, and code review. It constructs solutions through a forward expert-collaboration process and further re… view at source ↗

**Figure 6.** Figure 6: visualizes the archetype-level coverage of the learned skill library. The skill library learned from OPTMATH-TRAIN contains 56 skills and already spans 11 optimization categories, including Network Flow, Assignment & Allocation, Location & Covering, Quadratic & Pairwise Selection, Production & Inventory, Scheduling, Routing, Knap56 skills 17 10 10 4 4 3 3 Network Flow (17) Assignment & Allocation (10) Lo… view at source ↗

**Figure 8.** Figure 8: reports ARI and Pairwise F1 under different hyperparameter settings. Very small ϵ values lead to over-fragmented clusters and near-zero agreement with the problem-type labels. Performance improves in the moderate range of ϵ, with 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 DBSCAN eps 1 2 3 min_samples 0.00 0.00 0.01 0.07 0.25 0.33 0.36 0.26 0.21 0.22 0.23 0.24 0.23 0.04 0.… view at source ↗

**Figure 7.** Figure 7: The solving accuracy of OPTSKILLS (based on DeepSeek-V3.2) on OptiBench with different numbers of skills in the skill library. F.4 The Usage Statistics of New and Old Skills on NLCO Benchmark On the NANO-CO dataset, OPTSKILLS expands the initial set of 46 existing skills by acquiring 47 new ones [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 9.** Figure 9: The utilization of newly acquired and existing [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Case study of skill-guided optimization modeling. A comparison of the skill-guided and noskill trajectories on the same optimization instance. H Prompts We provide the detailed prompts below. H.1 Prompt of LLM baseline PROMPT You are a practical function-calling optimization research agent. Follow the 3-stage guidance and rules below to complete the task: ### Modeling and solving the problem step by step… view at source ↗

read the original abstract

Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype-centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in-distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow-level skills. To improve out-of-distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state-of-the-art micro-averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB-NL, a highly challenging large-scale and high-dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek-V3.2-Thinking by 4.53%. After skill learning on Nano-CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract sketches an archetype clustering plus distillation pipeline for LLM optimization skills that targets generalization gaps, but without methods or results details the performance claims cannot be assessed.

read the letter

The main point to know is that OptSkills clusters problems by underlying archetypes, explores modeling options inside each cluster, distills the successful trajectories into reusable workflow skills, and updates the library for new problem types. This is positioned as an advance over case-level reuse that is sensitive to narrative changes.

The design is straightforward and directly responds to the stated limitations in current LLM optimization work. The reported numbers (68.27% micro-averaged, 26.91% on MIPLIB-NL beating one baseline by 4.53%, and 72.79% on an OOD set after Nano-CO training) are presented as evidence that the approach delivers both in-distribution and out-of-distribution gains.

The obvious limitation is that only the abstract is visible. There is no information on how archetypes are defined or clustered, what the distillation step actually does, the experimental protocol, baselines, dataset statistics, or any ablations. That absence makes it impossible to judge whether the accuracy figures reflect the method or other factors. The motivation is reasonable and the high-level pipeline is coherent, but the evidence is not yet inspectable.

This is aimed at people building LLM agents for operations research modeling. A reader already working in that niche could extract the clustering-plus-distillation idea for their own experiments.

It would be worth sending the full manuscript to referees if the complete version supplies the missing experimental controls and reproducible details, because the targeted problem is real even if the current write-up is too thin to evaluate.

Referee Report

2 major / 1 minor

Summary. The paper proposes OptSkills, an archetype-centric LLM-based agent system for optimization modeling and solving. Problems are clustered by underlying archetypes (rather than surface narratives); within each cluster the system explores modeling paradigms and solver configurations, distills successful trajectories into reusable workflow-level skills, and refines or expands the skill library for OOD cases. It reports a micro-averaged accuracy of 68.27% on diverse datasets, 26.91% on MIPLIB-NL (outperforming DeepSeek-V3.2-Thinking by 4.53%), and 72.79% on the OOD NLCO benchmark after skill learning on Nano-CO. Code and skills are released publicly.

Significance. If the reported accuracies and the archetype-clustering-plus-distillation pipeline are shown to be robust, the work could advance LLM-driven optimization by shifting from case-level reuse to reusable, archetype-derived skills that improve both in-distribution and out-of-distribution generalization. The public release of code and skills is a clear strength that supports reproducibility and community follow-up.

major comments (2)

[Abstract] Abstract: the manuscript supplies only an abstract containing accuracy figures and benchmark names but no experimental protocol, baseline details, dataset statistics, number of runs, error bars, or ablation results. Without these it is impossible to determine whether the reported numbers support the generalization claims.
[Abstract] Abstract: the central design claim—that clustering by archetypes (rather than surface narratives) yields clusters whose internal modeling and solver trajectories can be distilled into generalizable skills—is stated as motivation but is unsupported by any description of archetype definition, clustering algorithm, or distillation mechanics, leaving the core technical contribution unevaluable.

minor comments (1)

[Abstract] Abstract: the GitHub link for code and skills is a positive reproducibility signal but the abstract would benefit from a brief statement of the number of archetypes or size of the skill library to give readers context for the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. The concerns focus on the abstract's brevity, which is a valid observation given space limits. The full manuscript provides the requested details in the body, but we agree the abstract should be expanded for standalone evaluability. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript supplies only an abstract containing accuracy figures and benchmark names but no experimental protocol, baseline details, dataset statistics, number of runs, error bars, or ablation results. Without these it is impossible to determine whether the reported numbers support the generalization claims.

Authors: We agree the abstract is too concise on these points. The full paper (Section 4) specifies the evaluation protocol (including prompt templates and solver interfaces), baselines (DeepSeek-V3.2-Thinking and others), dataset statistics (problem counts and types across MIPLIB-NL, NLCO, etc.), multiple runs, and ablation results on clustering and distillation. We will revise the abstract to add one sentence summarizing the protocol, number of runs, and key ablation findings. revision: yes
Referee: [Abstract] Abstract: the central design claim—that clustering by archetypes (rather than surface narratives) yields clusters whose internal modeling and solver trajectories can be distilled into generalizable skills—is stated as motivation but is unsupported by any description of archetype definition, clustering algorithm, or distillation mechanics, leaving the core technical contribution unevaluable.

Authors: We agree the abstract states the claim at a high level without mechanics. The manuscript body (Sections 3.1–3.3) defines archetypes via structural features (linearity, constraint types, objective form), uses embedding-based clustering (with the specific algorithm and hyperparameters), and details distillation (trajectory filtering, workflow extraction via LLM summarization, and skill library update rules). We will revise the abstract to briefly name these components (e.g., 'archetype clustering via embeddings followed by workflow skill distillation'). revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims on external benchmarks

full rationale

The paper describes an archetype-clustering and skill-distillation pipeline for optimization modeling, then reports concrete accuracy figures (68.27% micro-averaged, 26.91% on MIPLIB-NL, 72.79% on OOD NLCO) against named external benchmarks. No equations, fitted parameters, or derivation steps are supplied that would make any reported performance equivalent to its own inputs by construction. The central claims rest on measured outcomes rather than self-referential definitions or load-bearing self-citations. This is the normal case of an empirical systems paper whose results are falsifiable against independent test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach is described at the level of standard clustering and trajectory distillation applied to LLM outputs.

pith-pipeline@v0.9.1-grok · 5785 in / 1134 out tokens · 29764 ms · 2026-06-29T07:04:24.555125+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 5 canonical work pages · 4 internal anchors

[1]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models. In Advances in Forty-first International Conference on Machine Learning, Vienna, Austria. Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. 2026. EvoSkill: Auto- mated skill discovery for multi-agent systems. arXiv preprint arXiv:2603.02766. Y...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

In Forty-third International Conference on Machine Learning

Constructing industrial-scale optimization modeling benchmark. In Forty-third International Conference on Machine Learning. Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, and Chung-Piaw Teo. 2026. Large- scale optimization model auto-formulation: Harness- ing llm flexibility via structured w...

work page arXiv 2026
[3]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Trace2Skill: Distill trajectory-local lessons into transferable agent skills. arXiv preprint arXiv:2603.25158. OpenAI. 2026. Introducing gpt-5.4. Weichun Shi, Minghao Liu, Wanting Zhang, Langchen Shi, Fuqi Jia, Feifei Ma, and Jian Zhang

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

ConstraintLLM: A neuro-symbolic frame- work for industrial-level constraint programming. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15999–16019, Suzhou, China. Association for Com- putational Linguistics. Yifan Shi, Jialong Shi, Jiayi Wang, Ye Fan, and Jiany- ong Sun. 2026. MIRROR: A multi-agent framew...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill: Learning and evolving mem- ory skills for self-evolving agents. arXiv preprint arXiv:2602.02474. Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Meng- long Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, and Jun Wang. 2026. Memento- Skills: Let agents...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Preserve directed pair set P = [(i,j) for i in N for j in N if i != j] maximize sum(benefit[(i,j)] * y[(i,j)] for (i,j) in P)
[7]

Verify against the same objective Selected nodes: [1, 2, 3]Total benefit: 127(1,2)=24, (2,1)=22(1,3)=17, (3,1)=21 (2,3)=18, (3,2)=25 Final answer: 127
[8]

63.5counts each unordered pair twice

Correct solution, then self-doubt Selected nodes: [1, 2, 3] Objective value: 127 The agent treats this as double-counting: Total benefit ... 63.5counts each unordered pair twice
[9]

The skill stabilizes the semantic choice P={(i,j):i!=j}

Optimizes a different problem benefit = {(0,1):16, ..., (3,4):17}# only define for i < jSelected nodes: [0, 2, 4] Objective value: 65 Final answer: 65 INSIGHT Skill effect. The skill stabilizes the semantic choice P={(i,j):i!=j}. The correct set [1,2,3] yields 24+22+17+21+18+25=127. Failure mode. The no-skill run finds 127 but changes the task to unordere...
[11]

RESULT: {{ objective_value}}

**Solving Stage**: Generate the solver- based Python code and wrapped in < solver_code> tags. - you may use the python-based modeling language as follows: ortools , pyomo. - you may use the solver as follows: gurobi, clp, glpk, ipopt, highs, cbc, scip, mindtpy. You need to select an appropriate solver based on the type of the problem (linear or nonlinear,...
[13]

keywords

minimally edit the problem text into a scenario-neutral version. Task A: Keyword extraction - Extract high-signal structural keywords and place them into: -`variable` -`constraint` -`objective` - The goal is not only mathematical correctness, but also discriminative power: the extracted keywords should help distinguish different canonical optimization cla...
[14]

extract scenario-agnostic optimization keywords into three slots
[15]

keywords

minimally edit the problem text into a scenario-neutral version. Task A: Keyword extraction - Extract high-signal structural keywords and place them into: -`variable` -`constraint` -`objective` - The goal is not only mathematical correctness, but also discriminative power: the extracted keywords should help distinguish different canonical optimization cla...
[16]

**Modeling Stage**: First output a valid formulation in JSON format wrapped in <formulation> tags
[17]

Ensure tool arguments are valid JSON

**Solving Stage**: Call the`run_code` tool to execute solver code. Ensure tool arguments are valid JSON
[18]

### Core Rules - At most one tool call per turn

**Final Stage**: Output the final numeric answer wrapped in <answer> tags. ### Core Rules - At most one tool call per turn. - Solver code must print`RESULT:<number >`. - Never assume tool success without checking feedback. - Select exactly one workflow from the skill (parallel alternatives, not sequential steps). - Before any tool call, output: < formulat...
[19]

Infer problem family from objective/ constraint keywords
[20]

Prefer robust and standard backends when uncertain
[21]

Keep diversity across frameworks/ backends when possible
[22]

selected

Do not invent solver IDs; choose only from catalog. Input: <problem_description> {problem_description} </problem_description> <keywords> {keywords} </keywords> <top_k> {top_k} </top_k> <solver_catalog_json> {solver_catalog} </solver_catalog_json> Return JSON only: {{ "selected": [ {{ "solver_id": "string", "reason": "short reason" }} ] }} H.5 Skill Prompt...
[23]

Distill lessons, do not narrate trajectory
[24]

Keep workflow alternatives parallel ( not sequential)
[25]

Use solver-aware modeling templates
[26]

Make result parsing and failure handling explicit
[27]

Prefer concise technical clarity over decorative prose
[28]

Suitable code usage can be placed in every step with explicit comments
[29]

sets": [],

Keep it general,Use placeholders instead of specific values. The skill should apply to similar problems, not just this one, not just this scenario. The Skill Name, Skill Description, Workflow Names, Step Names should all be generalized and not scenario-specific. Hard format contract: - Keep this section order exactly: frontmatter -> # Workflow 1 (...) -> ...
[30]

Preserve the outer document skeleton exactly: - keep the full YAML frontmatter, - keep both opening and closing`---`, - keep all existing YAML keys, - keep key order unchanged, - keep the top-level section structure and heading order unchanged unless explicitly impossible to maintain validity
[31]

Prefer local edits over full rewrites

Within existing sections, you may rewrite, compress, deduplicate, clarify, and generalize content. Prefer local edits over full rewrites
[32]

Remove only weak, vague, repetitive, or redundant wording

Preserve all useful existing workflows. Remove only weak, vague, repetitive, or redundant wording
[33]

You may clarify or tighten them if supported by the trajectory

Do not remove explicit execution checks, solver/status verification steps, or parseable output contracts. You may clarify or tighten them if supported by the trajectory
[34]

Exclude scenario-specific story details from the analysis, such as names, one-time events, or incidental numeric values, unless they represent a reusable operational threshold or rule
[35]

Convert hardcoded task-specific values into placeholders such as`[ TARGET]`,`[QUERY]`,`[TIME_LIMIT]`, `[CAPACITY]`, or similar appropriate abstractions

Replace overly specific examples with reusable patterns when possible. Convert hardcoded task-specific values into placeholders such as`[ TARGET]`,`[QUERY]`,`[TIME_LIMIT]`, `[CAPACITY]`, or similar appropriate abstractions
[36]

Preserve skill-discriminative information. Do not remove or blur details that define this SKILL's distinctive optimization structure, including when relevant: - problem class, - decision variable pattern, - objective type, - key constraint structure, - solver or modeling assumptions
[37]

Consolidate overlapping explanations into clearer single statements

Merge duplicate or near-duplicate content only when meaning is preserved. Consolidate overlapping explanations into clearer single statements
[38]

For`positive`labels: - strengthen successful SOPs supported by the analysis, - clarify reusable step order, - make successful checks, thresholds, or decision criteria more explicit, - do not add unsupported new methods or techniques
[39]

For`negative`labels: - prioritize prerequisite checks, early failure detection, recovery steps, and fallback guidance, - do not ban a method entirely unless the analysis clearly shows a general failure mode rather than a one-off execution issue, - do not overwrite or weaken proven successful workflows unless the analysis reveals a general flaw in them
[40]

Do not introduce new solver strategies, modeling tricks, tools, or procedures unless they are clearly supported by either the existing SKILL or the trajectory
[41]

Remove verbosity that does not improve execution quality

Keep the content concise, actionable, and reusable. Remove verbosity that does not improve execution quality
[42]

Maintain consistent formatting and scannability within the existing structure: - clear steps, - consistent bullet style, - concise wording, - no unnecessary prose expansion
[43]

Perform a consistency audit across prose, formulation templates, and code examples: - ensure sign conventions, index ordering, objective sense, constraint direction, and variable semantics are mutually consistent, - if a contradiction exists, prefer the smallest edit that restores internal consistency, - do not preserve contradictory formulations merely t...

[1] [1]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

OptiMUS: Scalable optimization modeling with (MI)LP solvers and large language models. In Advances in Forty-first International Conference on Machine Learning, Vienna, Austria. Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. 2026. EvoSkill: Auto- mated skill discovery for multi-agent systems. arXiv preprint arXiv:2603.02766. Y...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

In Forty-third International Conference on Machine Learning

Constructing industrial-scale optimization modeling benchmark. In Forty-third International Conference on Machine Learning. Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, and Chung-Piaw Teo. 2026. Large- scale optimization model auto-formulation: Harness- ing llm flexibility via structured w...

work page arXiv 2026

[3] [3]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Trace2Skill: Distill trajectory-local lessons into transferable agent skills. arXiv preprint arXiv:2603.25158. OpenAI. 2026. Introducing gpt-5.4. Weichun Shi, Minghao Liu, Wanting Zhang, Langchen Shi, Fuqi Jia, Feifei Ma, and Jian Zhang

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

ConstraintLLM: A neuro-symbolic frame- work for industrial-level constraint programming. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15999–16019, Suzhou, China. Association for Com- putational Linguistics. Yifan Shi, Jialong Shi, Jiayi Wang, Ye Fan, and Jiany- ong Sun. 2026. MIRROR: A multi-agent framew...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill: Learning and evolving mem- ory skills for self-evolving agents. arXiv preprint arXiv:2602.02474. Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Meng- long Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, and Jun Wang. 2026. Memento- Skills: Let agents...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Preserve directed pair set P = [(i,j) for i in N for j in N if i != j] maximize sum(benefit[(i,j)] * y[(i,j)] for (i,j) in P)

[7] [7]

Verify against the same objective Selected nodes: [1, 2, 3]Total benefit: 127(1,2)=24, (2,1)=22(1,3)=17, (3,1)=21 (2,3)=18, (3,2)=25 Final answer: 127

[8] [8]

63.5counts each unordered pair twice

Correct solution, then self-doubt Selected nodes: [1, 2, 3] Objective value: 127 The agent treats this as double-counting: Total benefit ... 63.5counts each unordered pair twice

[9] [9]

The skill stabilizes the semantic choice P={(i,j):i!=j}

Optimizes a different problem benefit = {(0,1):16, ..., (3,4):17}# only define for i < jSelected nodes: [0, 2, 4] Objective value: 65 Final answer: 65 INSIGHT Skill effect. The skill stabilizes the semantic choice P={(i,j):i!=j}. The correct set [1,2,3] yields 24+22+17+21+18+25=127. Failure mode. The no-skill run finds 127 but changes the task to unordere...

[10] [11]

RESULT: {{ objective_value}}

**Solving Stage**: Generate the solver- based Python code and wrapped in < solver_code> tags. - you may use the python-based modeling language as follows: ortools , pyomo. - you may use the solver as follows: gurobi, clp, glpk, ipopt, highs, cbc, scip, mindtpy. You need to select an appropriate solver based on the type of the problem (linear or nonlinear,...

[11] [13]

keywords

minimally edit the problem text into a scenario-neutral version. Task A: Keyword extraction - Extract high-signal structural keywords and place them into: -`variable` -`constraint` -`objective` - The goal is not only mathematical correctness, but also discriminative power: the extracted keywords should help distinguish different canonical optimization cla...

[12] [14]

extract scenario-agnostic optimization keywords into three slots

[13] [15]

keywords

minimally edit the problem text into a scenario-neutral version. Task A: Keyword extraction - Extract high-signal structural keywords and place them into: -`variable` -`constraint` -`objective` - The goal is not only mathematical correctness, but also discriminative power: the extracted keywords should help distinguish different canonical optimization cla...

[14] [16]

**Modeling Stage**: First output a valid formulation in JSON format wrapped in <formulation> tags

[15] [17]

Ensure tool arguments are valid JSON

**Solving Stage**: Call the`run_code` tool to execute solver code. Ensure tool arguments are valid JSON

[16] [18]

### Core Rules - At most one tool call per turn

**Final Stage**: Output the final numeric answer wrapped in <answer> tags. ### Core Rules - At most one tool call per turn. - Solver code must print`RESULT:<number >`. - Never assume tool success without checking feedback. - Select exactly one workflow from the skill (parallel alternatives, not sequential steps). - Before any tool call, output: < formulat...

[17] [19]

Infer problem family from objective/ constraint keywords

[18] [20]

Prefer robust and standard backends when uncertain

[19] [21]

Keep diversity across frameworks/ backends when possible

[20] [22]

selected

Do not invent solver IDs; choose only from catalog. Input: <problem_description> {problem_description} </problem_description> <keywords> {keywords} </keywords> <top_k> {top_k} </top_k> <solver_catalog_json> {solver_catalog} </solver_catalog_json> Return JSON only: {{ "selected": [ {{ "solver_id": "string", "reason": "short reason" }} ] }} H.5 Skill Prompt...

[21] [23]

Distill lessons, do not narrate trajectory

[22] [24]

Keep workflow alternatives parallel ( not sequential)

[23] [25]

Use solver-aware modeling templates

[24] [26]

Make result parsing and failure handling explicit

[25] [27]

Prefer concise technical clarity over decorative prose

[26] [28]

Suitable code usage can be placed in every step with explicit comments

[27] [29]

sets": [],

Keep it general,Use placeholders instead of specific values. The skill should apply to similar problems, not just this one, not just this scenario. The Skill Name, Skill Description, Workflow Names, Step Names should all be generalized and not scenario-specific. Hard format contract: - Keep this section order exactly: frontmatter -> # Workflow 1 (...) -> ...

[28] [30]

Preserve the outer document skeleton exactly: - keep the full YAML frontmatter, - keep both opening and closing`---`, - keep all existing YAML keys, - keep key order unchanged, - keep the top-level section structure and heading order unchanged unless explicitly impossible to maintain validity

[29] [31]

Prefer local edits over full rewrites

Within existing sections, you may rewrite, compress, deduplicate, clarify, and generalize content. Prefer local edits over full rewrites

[30] [32]

Remove only weak, vague, repetitive, or redundant wording

Preserve all useful existing workflows. Remove only weak, vague, repetitive, or redundant wording

[31] [33]

You may clarify or tighten them if supported by the trajectory

Do not remove explicit execution checks, solver/status verification steps, or parseable output contracts. You may clarify or tighten them if supported by the trajectory

[32] [34]

Exclude scenario-specific story details from the analysis, such as names, one-time events, or incidental numeric values, unless they represent a reusable operational threshold or rule

[33] [35]

Convert hardcoded task-specific values into placeholders such as`[ TARGET]`,`[QUERY]`,`[TIME_LIMIT]`, `[CAPACITY]`, or similar appropriate abstractions

Replace overly specific examples with reusable patterns when possible. Convert hardcoded task-specific values into placeholders such as`[ TARGET]`,`[QUERY]`,`[TIME_LIMIT]`, `[CAPACITY]`, or similar appropriate abstractions

[34] [36]

Preserve skill-discriminative information. Do not remove or blur details that define this SKILL's distinctive optimization structure, including when relevant: - problem class, - decision variable pattern, - objective type, - key constraint structure, - solver or modeling assumptions

[35] [37]

Consolidate overlapping explanations into clearer single statements

Merge duplicate or near-duplicate content only when meaning is preserved. Consolidate overlapping explanations into clearer single statements

[36] [38]

For`positive`labels: - strengthen successful SOPs supported by the analysis, - clarify reusable step order, - make successful checks, thresholds, or decision criteria more explicit, - do not add unsupported new methods or techniques

[37] [39]

For`negative`labels: - prioritize prerequisite checks, early failure detection, recovery steps, and fallback guidance, - do not ban a method entirely unless the analysis clearly shows a general failure mode rather than a one-off execution issue, - do not overwrite or weaken proven successful workflows unless the analysis reveals a general flaw in them

[38] [40]

Do not introduce new solver strategies, modeling tricks, tools, or procedures unless they are clearly supported by either the existing SKILL or the trajectory

[39] [41]

Remove verbosity that does not improve execution quality

Keep the content concise, actionable, and reusable. Remove verbosity that does not improve execution quality

[40] [42]

Maintain consistent formatting and scannability within the existing structure: - clear steps, - consistent bullet style, - concise wording, - no unnecessary prose expansion

[41] [43]

Perform a consistency audit across prose, formulation templates, and code examples: - ensure sign conventions, index ordering, objective sense, constraint direction, and variable semantics are mutually consistent, - if a contradiction exists, prefer the smallest edit that restores internal consistency, - do not preserve contradictory formulations merely t...