Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design

Shengchao Chen; Sufen Ren; Ting Shu

arxiv: 2605.29421 · v1 · pith:QYZO4N6Cnew · submitted 2026-05-28 · 💻 cs.CL

Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design

Shengchao Chen , Ting Shu , Sufen Ren This is my paper

Pith reviewed 2026-06-29 07:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords photonic crystal fiberinverse designmemory policiesreinforcement learningagent frameworkskill bankexpert traces

0 comments

The pith

SkillPCF formulates photonic crystal fiber inverse design as memory-policy learning to build reusable skills from expert traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper casts PCF inverse design as a memory-policy problem because candidate geometries must meet coupled optical targets under expensive electromagnetic simulations, and existing methods do not accumulate knowledge across trials. It introduces SkillPCF, a closed-loop agent that maintains a physics-guided memory skill bank, selects skills via reinforcement learning, and evolves them through simulator feedback. The approach is supported by a dataset of 479 expert interaction traces spanning dispersion engineering, loss optimization, and multi-objective tasks. Experiments across LLM backbones show improved quality-efficiency trade-offs under fixed simulation budgets. This matters because it shifts the process from repeated one-shot attempts to progressive refinement using stored design policies.

Core claim

SkillPCF, a closed-loop agent framework that combines a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded skill evolution, achieves stronger design-quality and efficiency trade-offs under practical simulation budgets when trained on 479 expert interaction traces covering dispersion engineering, loss optimization, and multi-objective design.

What carries the argument

The physics-guided memory skill bank extracted from 479 expert interaction traces, which supplies reusable policies that reinforcement learning selects and evolves inside the closed-loop agent.

If this is right

The memory-skill paradigm enables accumulation of design knowledge rather than restarting each inverse-design task.
Reinforcement-learned selection and simulator-grounded evolution together produce measurable gains in quality-efficiency trade-offs.
The framework operates across multiple LLM backbones while outperforming one-shot parameter recommendation and surrogate-only pipelines.
Closed-loop interaction allows skills to adapt to coupled targets such as dispersion and loss under realistic simulation limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory-policy structure could be tested on other inverse-design domains that rely on costly forward simulations.
Larger or more varied expert-trace collections would provide a direct test of whether the learned skills remain stable outside the original dataset.
Feeding fabrication outcomes back into the skill-evolution loop could close the loop between simulation and physical realization.

Load-bearing premise

The 479 expert interaction traces form a representative and unbiased basis for learning reusable skills, and reinforcement learning can reliably select and evolve those skills in the high-dimensional, multi-objective PCF design space.

What would settle it

If SkillPCF applied to the 479-trace dataset produces no measurable improvement in design quality per simulation call relative to the classical and LLM baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.29421 by Shengchao Chen, Sufen Ren, Ting Shu.

**Figure 1.** Figure 1: Comparison of PCF inverse design paradigms. Left: Traditional numerical optimization requires domain expertise and exhaustive simulation with no knowledge retention. Middle: ML-based approaches enable fast one-shot prediction but lack iterative refinement and interpretability. Right: SkillPCF (ours) treats design as a multi-round interaction with a self-evolving memory agent, accumulating knowledge across … view at source ↗

**Figure 2.** Figure 2: SkillPCF framework overview. Design traces are decomposed into step-wise contexts. At each step t, the Controller encodes the design state (target specs, parameters, simulation outputs, and text span) together with retrieved memories, and selects Top-K skills from the Skill Bank. The Executor applies them to update the trace-specific Memory Bank. A Physics Environment supplies simulation-grounded rewards b… view at source ↗

**Figure 3.** Figure 3: PCFSkill dataset, which contains 479 interaction traces across 8 PCF families, comprising 2,507 spans (5.23 per trace; 393K tokens) with a 75.6% design success rate, plus 553 memorydependent evaluation queries and 596 error logs for skill evolution. Evaluation Queries. We construct 553 memorydependent queries with expert-annotated ground truth. Query types include design reasoning, trend prediction, visu… view at source ↗

**Figure 5.** Figure 5: Scaling under Qwen2.5-VL executor (3B/7B/32B/72B). points, while the right panel reflects increased domain relevance through keyword frequency and operation usage. These results suggest that designer-triggered updates improve not only memory quantity but also its physical informativeness. Examples of refined and newly introduced skills are provided in Appendix D. S1 S2 S3 S4 Evolution Stage 0 5 Memory C… view at source ↗

**Figure 6.** Figure 6: Memory dynamics during skill evolution. Left: memory growth across outer-loop. Right: keyword enrichment. 5.3. Case Studies We conduct a case study on six challenging PCF design tasks ( [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Hard-case stress test with real MEEP simulations across six PCF scenarios. Rows are design cases (H: hard, M: medium); columns are methods. Best viewed zoomed in and in color. Loss: 1.38e-13 dB/km n_eff: 1.4200 Calls/q: 1.02 PASS ZS1: Chiral-Twisted Hexagonal [ZERO-SHOT] SkillPCF (Ours) Loss: 1.49e-11 dB/km n_eff: 1.4200 Calls/q: 1.13 FAIL RAG Loss: 2.05e-11 dB/km n_eff: 1.4200 Calls/q: 1.12 FAIL CoN Loss:… view at source ↗

**Figure 9.** Figure 9: Zero-shot generalization to four novel PCF families never seen in training. Rows: ZS1 chiral-twisted hexagonal; ZS2 dual-core asymmetric coupler; ZS3 LMA Yb-doped; ZS4 SCgeneration flat. Best viewed zoomed in and in color. a learnable policy. By integrating skill selection, memory updates, and simulator verification, it transforms iterative traces into transferable optimization knowledge. Results show str… view at source ↗

**Figure 10.** Figure 10: Interactive SkillPCF design platform. Left: coupled physics verification and skill-evolution trajectory analysis, including parameter controls, synchronized structure/field, and round-level skill updates. Right: memory-augmented reasoning pipeline with design-context input, memory retrieval, generation, and verification feedback [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Efficiency comparison between SkillPCF and classical optimization methods [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Complex-family visual evidence for generated designs. Each row corresponds to one challenging PCF family (nested antiresonant, hybrid plasmonic, graded-index multiring). C. Implementation Details and Evaluation Metrics C.1. Evaluation Metrics, Human Scoring, and LLM Judge Protocol We employ a multi-level evaluation framework spanning text-based, reasoning, inverse design, physics, and human metrics. To ke… view at source ↗

**Figure 13.** Figure 13: presents a comprehensive comparison across six PCF design scenarios using real MEEP electromagnetic simulations. Each panel shows the mode field distribution (background), structure cross-section (inset), and key metrics. The cases span from challenging ultra-low-loss targets to moderate-difficulty standard designs: Loss: 7.56e-13 dB/km n_eff: 1.0100 Calls/q: 1.02 PASS H1: Nested Antiresonant [HARD] Skill… view at source ↗

**Figure 14.** Figure 14: visualizes the trade-off between design quality and computational efficiency. The Pareto frontier represents optimal operating points where no method can improve one metric without degrading the other. SkillPCF achieves the best efficiency-quality trade-off, requiring 100× fewer simulation calls than classical optimizers while maintaining competitive design quality. The Pareto frontier analysis shows that… view at source ↗

**Figure 15.** Figure 15: Zero-shot generalization to novel PCF structures never seen during training. Each row represents a new structural family; baseline methods are predominantly unsuccessful with occasional isolated passes, while SkillPCF consistently transfers learned physics principles. • ZS4 (SC-Generation Flat): Dual zero-dispersion wavelengths for supercontinuum generation that requires multiwavelength dispersion engine… view at source ↗

**Figure 16.** Figure 16: PCF family specialization analysis. Our SkillPCF (first column) maintains the highest success rate across all structural families. Cell values show design success rate (%). Family difficulty is annotated on the right colorbar. Future Directions. A natural research direction is to ask how broadly the memory-policy view of simulator-grounded design generalizes beyond fibers, whether the same closed-loop int… view at source ↗

read the original abstract

Photonic crystal fiber (PCF) inverse design remains challenging because candidate geometries must satisfy coupled optical targets under expensive electromagnetic simulation. Existing pipelines improve surrogate prediction or one-shot parameter recommendation, but they do not accumulate reusable design knowledge across iterative trials. We formulate PCF inverse design as a memory-policy learning problem and propose SkillPCF, a closed-loop agent framework that combines a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded skill evolution. We further construct a real-world dataset with 479 expert interaction traces (2,507 spans) and 553 memory-dependent evaluation queries covering dispersion engineering, loss optimization, and multi-objective design. Experiments across multiple LLM backbones and classical baselines show that SkillPCF achieves stronger design-quality and efficiency trade-offs under practical simulation budgets, demonstrating the effectiveness of our proposed memory-skill learning paradigm for physics-aware PCF inverse design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillPCF frames PCF inverse design as learning reusable memory policies from 479 expert traces, but the evidence for reliable skill transfer in high-dimensional spaces is still thin.

read the letter

The paper's main contribution is recasting inverse design as a memory-policy learning task. Instead of one-shot surrogates or direct optimization, SkillPCF builds a physics-guided skill bank from expert traces, uses reinforcement learning to pick and evolve skills, and closes the loop with simulator feedback. They also release a dataset of 479 traces (2,507 spans) plus 553 evaluation queries on dispersion, loss, and multi-objective PCF problems.

That dataset and the closed-loop setup are the parts that stand out. Collecting real expert interaction traces is non-trivial, and testing across LLM backbones plus classical baselines gives a practical check on whether the memory approach improves quality-efficiency trade-offs under limited simulation budgets.

The soft spot is the representativeness of the traces and how well RL handles skill selection in the actual design space. If the 479 traces leave gaps in the dispersion-loss regimes or if evaluation queries overlap too much with training distributions, the reported gains could partly reflect coverage rather than the paradigm itself. High-dimensional multi-objective spaces are known to trip up RL exploration, and the abstract does not yet show error bars, ablation on trace diversity, or clear evidence that skills evolve productively on out-of-distribution cases.

This is for people already working on agentic or RL methods for photonic or electromagnetic design. The dataset could be picked up independently. The central claim is plausible but hinges on assumptions about trace coverage and RL reliability that need more scrutiny.

Send it to peer review. The formulation and data are new enough that referees can test the generalization questions directly.

Referee Report

2 major / 1 minor

Summary. The paper formulates photonic crystal fiber (PCF) inverse design as a memory-policy learning problem and introduces SkillPCF, an agentic framework combining a physics-guided memory skill bank (built from 479 expert interaction traces comprising 2,507 spans), reinforcement-learned skill selection, and simulator-grounded skill evolution. A dataset of 553 memory-dependent evaluation queries is constructed covering dispersion engineering, loss optimization, and multi-objective design. Experiments across LLM backbones and classical baselines report that SkillPCF yields improved design-quality versus efficiency trade-offs under practical simulation budgets.

Significance. If validated, the memory-skill paradigm could meaningfully advance agentic methods for expensive physics simulations by enabling reuse of design knowledge across trials, moving beyond surrogate or one-shot approaches in PCF design. The explicit construction of an expert-trace dataset and closed-loop evolution mechanism are concrete strengths that could support follow-on work in related inverse-design domains.

major comments (2)

[Dataset construction (implied in abstract and experiments)] The central claim that SkillPCF achieves stronger trade-offs rests on the 479 expert traces yielding reusable skills that RL can reliably select and evolve. However, no quantitative coverage analysis (e.g., span of dispersion, loss, and multi-objective regimes) or bias assessment of the traces is described, leaving open the possibility that evaluation gains reflect distributional overlap with the 553 queries rather than generalization.
[RL skill selection and evolution (implied in method and experiments)] The assertion that reinforcement learning reliably performs skill selection and evolution in the high-dimensional, multi-objective PCF parameter space lacks supporting ablations or exploration metrics. Combinatorial selection in such spaces is prone to local optima; without evidence that the RL component overcomes this under limited simulation budgets, the efficiency gains cannot be attributed to the memory-policy paradigm.

minor comments (1)

The abstract states results 'across multiple LLM backbones and classical baselines' without naming the specific models or baselines; adding these (and any hyperparameter details) would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, providing clarifications and committing to revisions where the manuscript can be strengthened without misrepresenting our contributions.

read point-by-point responses

Referee: The central claim that SkillPCF achieves stronger trade-offs rests on the 479 expert traces yielding reusable skills that RL can reliably select and evolve. However, no quantitative coverage analysis (e.g., span of dispersion, loss, and multi-objective regimes) or bias assessment of the traces is described, leaving open the possibility that evaluation gains reflect distributional overlap with the 553 queries rather than generalization.

Authors: We acknowledge that the original manuscript does not include a quantitative coverage analysis or explicit bias assessment of the 479 expert traces. The traces were collected from expert interactions targeting the three regimes (dispersion engineering, loss optimization, multi-objective design) and the 553 queries were constructed as memory-dependent and held-out. To directly address the concern regarding potential overlap versus generalization, we will add a new subsection with coverage statistics (parameter spans, objective distributions) and a bias assessment in the revised manuscript. revision: yes
Referee: The assertion that reinforcement learning reliably performs skill selection and evolution in the high-dimensional, multi-objective PCF parameter space lacks supporting ablations or exploration metrics. Combinatorial selection in such spaces is prone to local optima; without evidence that the RL component overcomes this under limited simulation budgets, the efficiency gains cannot be attributed to the memory-policy paradigm.

Authors: The reported experiments compare SkillPCF against classical baselines and LLM variants without the full memory-policy components, showing consistent improvements in design-quality versus efficiency trade-offs. However, we agree that dedicated ablations isolating the RL skill selection module and exploration metrics (e.g., skill usage entropy, convergence behavior) are absent. We will incorporate these ablations and metrics in the revision to provide direct evidence that the RL component contributes to overcoming local optima under the given budgets. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework evaluated on constructed dataset

full rationale

The paper formulates PCF inverse design as a memory-policy learning problem and evaluates SkillPCF empirically on a dataset of 479 expert traces and 553 queries. No derivation chain, equations, or predictions are claimed that reduce to fitted inputs, self-definitions, or self-citation load-bearing premises. Results are presented as experimental outcomes under simulation budgets rather than tautological outputs of the method itself. The central claim rests on observed design-quality trade-offs, which are falsifiable against baselines and do not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5680 in / 1077 out tokens · 38035 ms · 2026-06-29T07:59:59.632964+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 13 canonical work pages · 10 internal anchors

[1]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Chhikara, P., Khant, D., Aryan, S., Singh, T., and Yadav, D. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Du, Y ., Huang, W., Zheng, D., Wang, Z., Montella, S., Lapata, M., Wong, K.-F., and Pan, J. Z. Rethinking memory in llm based agents: Representations, operations, and emerging topics.arXiv preprint arXiv:2505.00675,

work page arXiv
[4]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Memory os of ai agent

Kang, J., Ji, M., Zhao, Z., and Bai, T. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 25972–25981,

2025
[6]

A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024

Lee, K.-H., Chen, X., Furuta, H., Canny, J., and Fischer, I. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727,

work page arXiv
[7]

MiniMax-01: Scaling Foundation Models with Lightning Attention

Li, A., Gong, B., Yang, B., Shan, B., Liu, C., Zhu, C., Zhang, C., Guo, C., Chen, D., Li, D., et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Choosing how to remember: Adaptive memory structures for llm agents

Lu, M., Wu, M., Liu, F., Xu, J., Li, W., Wang, H., Hu, Z., Ding, Y ., Sun, Y ., Lu, J., et al. Choosing how to remember: Adaptive memory structures for llm agents. arXiv preprint arXiv:2602.14038,

work page arXiv
[10]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Wang, G., Liu, J., Chen, S., and Ren, S. Optimizing low- resolution spectral demodulation for long-period fiber gratings using residual convolutional neural networks. Optics Express, 33(4):8225–8238, 2025a. Wang, G., Liu, J., Chen, S., and Ren, S. Towards scal- able and accurate property prediction for photonic crystal fibers with federated learning.Optic...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Xia, P., Chen, J., Wang, H., Liu, J., Zeng, K., Wang, Y ., Han, S., Zhou, Y ., Zhao, X., Chen, H., et al. Skillrl: Evolv- ing agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Chain-of-note: Enhancing robustness in retrieval-augmented language models

Yu, W., Zhang, H., Pan, X., Cao, P., Ma, K., Li, J., Wang, H., and Yu, D. Chain-of-note: Enhancing robustness in retrieval-augmented language models. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 14672–14685,

2024
[15]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Zhang, H., Long, Q., Bao, J., Feng, T., Zhang, W., Yue, H., and Wang, W. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

11 Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design Appendix For readability and fast lookup, we organize the appendix into six blocks with direct hyperlinks: • Appendix A: System, Data, and Evaluation Protocols • Appendix B: Additional Case Studies • Appendix C: Implementation Details and Evaluation Metrics • Appendix D: Init...

2024

[1] [1]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Chhikara, P., Khant, D., Aryan, S., Singh, T., and Yadav, D. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Du, Y ., Huang, W., Zheng, D., Wang, Z., Montella, S., Lapata, M., Wong, K.-F., and Pan, J. Z. Rethinking memory in llm based agents: Representations, operations, and emerging topics.arXiv preprint arXiv:2505.00675,

work page arXiv

[4] [4]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Memory os of ai agent

Kang, J., Ji, M., Zhao, Z., and Bai, T. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 25972–25981,

2025

[6] [6]

A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024

Lee, K.-H., Chen, X., Furuta, H., Canny, J., and Fischer, I. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727,

work page arXiv

[7] [7]

MiniMax-01: Scaling Foundation Models with Lightning Attention

Li, A., Gong, B., Yang, B., Shan, B., Liu, C., Zhu, C., Zhang, C., Guo, C., Chen, D., Li, D., et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Choosing how to remember: Adaptive memory structures for llm agents

Lu, M., Wu, M., Liu, F., Xu, J., Li, W., Wang, H., Hu, Z., Ding, Y ., Sun, Y ., Lu, J., et al. Choosing how to remember: Adaptive memory structures for llm agents. arXiv preprint arXiv:2602.14038,

work page arXiv

[10] [10]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Wang, G., Liu, J., Chen, S., and Ren, S. Optimizing low- resolution spectral demodulation for long-period fiber gratings using residual convolutional neural networks. Optics Express, 33(4):8225–8238, 2025a. Wang, G., Liu, J., Chen, S., and Ren, S. Towards scal- able and accurate property prediction for photonic crystal fibers with federated learning.Optic...

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Xia, P., Chen, J., Wang, H., Liu, J., Zeng, K., Wang, Y ., Han, S., Zhou, Y ., Zhao, X., Chen, H., et al. Skillrl: Evolv- ing agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Chain-of-note: Enhancing robustness in retrieval-augmented language models

Yu, W., Zhang, H., Pan, X., Cao, P., Ma, K., Li, J., Wang, H., and Yu, D. Chain-of-note: Enhancing robustness in retrieval-augmented language models. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 14672–14685,

2024

[15] [15]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Zhang, H., Long, Q., Bao, J., Feng, T., Zhang, W., Yue, H., and Wang, W. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

11 Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design Appendix For readability and fast lookup, we organize the appendix into six blocks with direct hyperlinks: • Appendix A: System, Data, and Evaluation Protocols • Appendix B: Additional Case Studies • Appendix C: Implementation Details and Evaluation Metrics • Appendix D: Init...

2024