Dystruct formulates flexible-length generation in diffusion language models as a dynamic structural inference problem solved via Bayesian integration of local uncertainty and structural signals.
hub Tool reference
Training verifiers to solve math word problems
Tool reference. 100% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.
hub tools
citation-role summary
citation-polarity summary
roles
dataset 7polarities
use dataset 7representative citing papers
A single LoRA adapter placed at the gradient-energy-dominant shallow FFN module outperforms distributed LoRA across instruction, math, code, and conversation tasks.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.
CTF4Nuclear proposes a common task framework for benchmarking ML methods on nuclear engineering datasets using 12 metrics and a new sparse-measurement system monitoring paradigm.
Verifier-backed committee search boosts a weak reasoning model from 67% to 76.4% on SWE-bench Verified, matching stronger models by using local soundness signals to select among proposals.
EMO pretrains MoEs using document boundaries to induce semantic expert specialization, enabling modular subset deployment with minimal accuracy loss unlike standard MoEs.
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Gradual fine-tuning that removes explicit CoT steps lets GPT-2 Small reach 99% accuracy on 9x9 multiplication and Mistral 7B exceed 50% on GSM8K with no intermediate outputs.
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Squeeze Evolve is a multi-model orchestration framework that improves efficiency and performance in verifier-free evolutionary inference, cutting costs up to 3x while matching verifier-based methods on several benchmarks.
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.
STELLA aligns ESM3 bimodal sequence-structure encodings with Llama-3.1-8B text modeling to claim state-of-the-art results on protein functional description prediction and enzyme-catalyzed reaction prediction.
Rule-based RL on 5K logic puzzles induces advanced reasoning in a 7B model that transfers to AIME and AMC.
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Agentic RAG embeds agents with reflection, planning, tool use, and collaboration into retrieval pipelines to overcome static RAG limitations, and the survey offers a taxonomy by agent count, control, autonomy, and knowledge representation plus applications and open challenges.
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
citing papers explorer
-
Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference
Dystruct formulates flexible-length generation in diffusion language models as a dynamic structural inference problem solved via Bayesian integration of local uncertainty and structural signals.
-
Rethinking Adapter Placement: A Dominant Adaptation Module Perspective
A single LoRA adapter placed at the gradient-energy-dominant shallow FFN module outperforms distributed LoRA across instruction, math, code, and conversation tasks.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.
-
CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models
CTF4Nuclear proposes a common task framework for benchmarking ML methods on nuclear engineering datasets using 12 metrics and a new sparse-measurement system monitoring paradigm.
-
Agentic Systems as Boosting Weak Reasoning Models
Verifier-backed committee search boosts a weak reasoning model from 67% to 76.4% on SWE-bench Verified, matching stronger models by using local soundness signals to select among proposals.
-
EMO: Pretraining Mixture of Experts for Emergent Modularity
EMO pretrains MoEs using document boundaries to induce semantic expert specialization, enabling modular subset deployment with minimal accuracy loss unlike standard MoEs.
-
RAGEN-2: Reasoning Collapse in Agentic RL
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Gradual fine-tuning that removes explicit CoT steps lets GPT-2 Small reach 99% accuracy on 9x9 multiplication and Mistral 7B exceed 50% on GSM8K with no intermediate outputs.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Squeeze Evolve is a multi-model orchestration framework that improves efficiency and performance in verifier-free evolutionary inference, cutting costs up to 3x while matching verifier-based methods on several benchmarks.
-
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.
-
STELLA: A Multimodal LLM for Protein Functional Annotation via Unified Sequence-Structure Encoding
STELLA aligns ESM3 bimodal sequence-structure encodings with Llama-3.1-8B text modeling to claim state-of-the-art results on protein functional description prediction and enzyme-catalyzed reaction prediction.
-
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Rule-based RL on 5K logic puzzles induces advanced reasoning in a 7B model that transfers to AIME and AMC.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
-
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
Agentic RAG embeds agents with reflection, planning, tool use, and collaboration into retrieval pipelines to overcome static RAG limitations, and the survey offers a taxonomy by agent count, control, autonomy, and knowledge representation plus applications and open challenges.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
- Language Modeling with Hyperspherical Flows