SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
arXiv preprint arXiv:2404.02078 , year=
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
citing papers explorer
-
Training Software Engineering Agents and Verifiers with SWE-Gym
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
-
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
Holder Policy Optimisation
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
-
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
-
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.