hub

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team · 2025

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

browse 11 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Using large language models for embodied planning introduces systematic safety risks

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

GRIT: Teaching MLLMs to Think with Images

cs.CV · 2025-05-21 · unverdicted · novelty 7.0

GRIT introduces a grounded reasoning paradigm for MLLMs where reasoning chains interleave text and bounding boxes, trained via GRPO-GR reinforcement learning on as few as 20 examples without annotations.

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

cs.LG · 2025-04-29 · accept · novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

cs.CL · 2025-03-27 · unverdicted · novelty 7.0

OlymMATH is a 350-problem Olympiad math benchmark combining bilingual natural-language evaluation with Lean 4 formal verification to test LLM reasoning.

From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning gap rather than knowledge deficits.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

cs.LG · 2026-04-14 · unverdicted · novelty 6.0 · 2 refs

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

Visual Reasoning Agent: Robust Vision Systems in Remote Sensing via Inference-Time Scaling

cs.CV · 2025-09-19 · unverdicted · novelty 5.0

VRA is a training-free agentic framework that orchestrates off-the-shelf LVLMs with a reasoning model via iterative verification and refinement, raising accuracy on remote sensing VQA from 52.8% to 78.8% and delivering up to 40.67% gains on hard question types.

MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making

cs.AI · 2025-07-25 · unverdicted · novelty 5.0

MAC framework selects Pareto-optimal LLM agents and masks low cross-consistency outputs for adaptive collaboration in medical decision-making.

Perovskite-R1: a domain-specialized large language model for intelligent discovery of precursor additives and experimental design

cs.LG · 2025-07-22 · unverdicted · novelty 4.0

A fine-tuned LLM called Perovskite-R1, built from curated perovskite literature and material libraries, proposes precursor additives and designs with some experimental validation showing improved stability and performance.

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

cs.AI · 2025-06-01 · unverdicted · novelty 4.0

Agents should invoke external tools only when epistemically necessary, per the introduced Theory of Agent framework that frames tool use as a decision under uncertainty.

Skywork Open Reasoner 1 Technical Report

cs.LG · 2025-05-28 · conditional · novelty 4.0

Skywork-OR1 uses RL on distilled CoT models to lift math and coding benchmark accuracy by 13-15 points while open-sourcing everything.

citing papers explorer

Showing 11 of 11 citing papers.

Using large language models for embodied planning introduces systematic safety risks cs.AI · 2026-04-20 · unverdicted · none · ref 21
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
GRIT: Teaching MLLMs to Think with Images cs.CV · 2025-05-21 · unverdicted · none · ref 4
GRIT introduces a grounded reasoning paradigm for MLLMs where reasoning chains interleave text and bounding boxes, trained via GRPO-GR reinforcement learning on as few as 20 examples without annotations.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example cs.LG · 2025-04-29 · accept · none · ref 62
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models cs.CL · 2025-03-27 · unverdicted · none · ref 23
OlymMATH is a 350-problem Olympiad math benchmark combining bilingual natural-language evaluation with Lean 4 formal verification to test LLM reasoning.
From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 29
LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning gap rather than knowledge deficits.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 65 · 2 links
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Visual Reasoning Agent: Robust Vision Systems in Remote Sensing via Inference-Time Scaling cs.CV · 2025-09-19 · unverdicted · none · ref 12
VRA is a training-free agentic framework that orchestrates off-the-shelf LVLMs with a reasoning model via iterative verification and refinement, raising accuracy on remote sensing VQA from 52.8% to 78.8% and delivering up to 40.67% gains on hard question types.
MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making cs.AI · 2025-07-25 · unverdicted · none · ref 41
MAC framework selects Pareto-optimal LLM agents and masks low cross-consistency outputs for adaptive collaboration in medical decision-making.
Perovskite-R1: a domain-specialized large language model for intelligent discovery of precursor additives and experimental design cs.LG · 2025-07-22 · unverdicted · none · ref 34
A fine-tuned LLM called Perovskite-R1, built from curated perovskite literature and material libraries, proposes precursor additives and designs with some experimental validation showing improved stability and performance.
Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary cs.AI · 2025-06-01 · unverdicted · none · ref 81
Agents should invoke external tools only when epistemically necessary, per the introduced Theory of Agent framework that frames tool use as a decision under uncertainty.
Skywork Open Reasoner 1 Technical Report cs.LG · 2025-05-28 · conditional · none · ref 25
Skywork-OR1 uses RL on distilled CoT models to lift math and coding benchmark accuracy by 13-15 points while open-sourcing everything.

Qwq-32b: Embracing the power of reinforcement learning, March 2025

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer