MLFMs combine masking with continuous flows to scale flow-based language models to reasoning and instruction-following tasks on GSM8K and MT-Bench.
OpenCodeInstruct: A large-scale instruction tuning dataset for code LLMs.arXiv preprint
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
dataset 2polarities
use dataset 2representative citing papers
CodeBlock partitions code responses into syntactically coherent blocks, scores them with generalized cross-entropy and data-flow signals, and applies sparse supervision to achieve higher pass@1 than full SFT using 1.9% of tokens on six benchmarks.
PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
Self-CTRL uses RL to align LM self-explanations with behavior, boosting bias correlation to R²=0.64 and refusal prediction accuracy to 92% while cutting harm failures to 0.5%.
Grammar-constrained decoding enables a new jailbreak (CodeSpear) on LLMs for malicious code, countered by CodeShield which trains models to output harmless honeypot code under GCD while preserving refusals.
LLMs frequently reverse their stated coding preferences when shown actual code instead of descriptions, show positional bias, and produce more polarized ratings than human experts on complexity, commenting, modularity, and readability.
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Empirical study on five LLMs finds pretrained-to-aligned paths yield bigger gains over baseline than finetuned-to-aligned paths, though absolute accuracy remains lower for pretrained starts.
citing papers explorer
-
Subjective Code Preferences in Experts and Large Language Models
LLMs frequently reverse their stated coding preferences when shown actual code instead of descriptions, show positional bias, and produce more polarized ratings than human experts on complexity, commenting, modularity, and readability.