COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
hub
Self-play with execution feedback: Improving instruction-following capabilities of large language models
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
MulDimIF introduces a multi-dimensional constraint framework and generation pipeline that reveals sharp performance drops in LLMs as instruction complexity rises and shows targeted training gains from attention module updates.
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding, and QA tasks.
A label-free self-supervised RL method derives rewards from instructions via constraint decomposition and binary classification, yielding improvements on in-domain and out-of-domain instruction-following tasks.
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Qwen2.5 LLMs scale pre-training data to 18 trillion tokens and apply multistage reinforcement learning, achieving competitive performance on benchmarks with models up to 5 times larger.
citing papers explorer
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
-
Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution
IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
-
Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs
TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
MulDimIF introduces a multi-dimensional constraint framework and generation pipeline that reveals sharp performance drops in LLMs as instruction complexity rises and shows targeted training gains from attention module updates.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
-
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding, and QA tasks.
-
Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
A label-free self-supervised RL method derives rewards from instructions via constraint decomposition and binary classification, yielding improvements on in-domain and out-of-domain instruction-following tasks.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
Qwen2.5 Technical Report
Qwen2.5 LLMs scale pre-training data to 18 trillion tokens and apply multistage reinforcement learning, achieving competitive performance on benchmarks with models up to 5 times larger.