SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
arxiv preprint arXiv:2404.02823 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
baseline 1polarities
baseline 1representative citing papers
ImpRIF improves LLM complex instruction following by synthesizing data from reasoning graphs and training models to reason explicitly along those graphs.
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
MAGIC-HMO is a multi-agent framework that treats Chinese short-form creative NLG as heterogeneous multi-objective optimization over personalized constraints plus explanation reliability and outperforms baselines on a baby-naming benchmark.
A label-free self-supervised RL method derives rewards from instructions via constraint decomposition and binary classification, yielding improvements on in-domain and out-of-domain instruction-following tasks.
citing papers explorer
-
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
-
ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following
ImpRIF improves LLM complex instruction following by synthesizing data from reasoning graphs and training models to reason explicitly along those graphs.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
-
Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization
MAGIC-HMO is a multi-agent framework that treats Chinese short-form creative NLG as heterogeneous multi-objective optimization over personalized constraints plus explanation reliability and outperforms baselines on a baby-naming benchmark.
-
Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
A label-free self-supervised RL method derives rewards from instructions via constraint decomposition and binary classification, yielding improvements on in-domain and out-of-domain instruction-following tasks.