SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
Openmathinstruct-1: A 1.8 million math instruction tuning dataset
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
dataset 1polarities
use dataset 1representative citing papers
Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
KRPO uses a Kalman filter to estimate latent prompt-level reward baselines from per-group rewards in GRPO, yielding better reward curves and accuracy on math reasoning benchmarks.
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
citing papers explorer
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
-
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
-
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
KRPO uses a Kalman filter to estimate latent prompt-level reward baselines from per-group rewards in GRPO, yielding better reward curves and accuracy on math reasoning benchmarks.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.