SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Unnatural instructions: Tuning language models with (almost) no human labor
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
A literature review concludes that pursuing consensus in data annotation creates biased AI by dismissing subjective disagreements and enforcing geographic hegemony, and proposes mapping diversity instead.
citing papers explorer
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
The Consensus Trap: Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation
A literature review concludes that pursuing consensus in data annotation creates biased AI by dismissing subjective disagreements and enforcing geographic hegemony, and proposes mapping diversity instead.