Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
Air: A systematic analysis of annotations, instructions, and response pairs in preference dataset
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
representative citing papers
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.