The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.
CoRR , volume =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2representative citing papers
CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.
citing papers explorer
-
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.
-
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It
CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.