Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
hub Canonical reference
CoRR , volume =
Canonical reference. 89% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
KAT detects persistent low-KL agreement traps in on-policy distillation via a dynamic threshold to filter weak supervision, improving avg@k by 2.66% and pass@k by 3.43% on four math benchmarks while shortening rollouts by 59.73%.
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
RiVER applies calibrated ranking rewards from execution scores to train LLMs on score-based tasks without ground-truth, producing gains on both heuristic contests and exact-solution coding benchmarks.
CRPO extends group relative policy optimization with stage-dependent uncertainty modeling and reports a 10.4 percentage point weighted F1 gain over RL baselines across 8 mental health datasets.
Soft-prompt tuning with 10 vectors improves format compliance on LLM benchmarks and provides a low-cost proxy for comparing base models.
GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.
Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
Non-model gains via inference, systems, and assets can drive AI capabilities independently of base models, requiring governance beyond model-level evaluation and mitigation.
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
EfficientRollout applies self-speculative decoding with quantized drafter induction and system-aware acceptance policies to cut RL rollout latency up to 19.6% while preserving final model quality.
PriFT uses token reweighting signals from a frozen pretrained model to stabilize SFT and achieve better results than standard SFT baselines on reasoning tasks.
Thinking-RFT improves Theory of Mind accuracy by 6% over SFT on shortcut-free datasets, with 10% gains on higher-order reasoning and better generalization to new domains.
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.
CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-critical settings.
SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
citing papers explorer
No citing papers match the current filters.