Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
hub Canonical reference
Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744
Canonical reference. 88% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
A shallow dense Transformer achieves uniform epsilon-approximation of alpha-Holder functions with O(epsilon^{-d/alpha}) parameters and near-minimax generalization error O(n^{-2alpha/(2alpha+d)} log n).
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.
NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
citing papers explorer
-
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
-
FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios
FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
-
OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research
OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
-
Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity
A shallow dense Transformer achieves uniform epsilon-approximation of alpha-Holder functions with O(epsilon^{-d/alpha}) parameters and near-minimax generalization error O(n^{-2alpha/(2alpha+d)} log n).
-
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
-
Grounded Reinforcement Learning for Visual Reasoning
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
-
Efficient 3D Content Reconstruction and Generation
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
-
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
-
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
-
Computational Hermeneutics: Evaluating generative AI as a cultural technology
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.
-
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.