DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
dataset 2polarities
use dataset 2representative citing papers
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
citing papers explorer
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
-
Robust Policy Optimization to Prevent Catastrophic Forgetting
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
-
Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs
STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.