super hub Baseline reference

Measuring Mathematical Problem Solving With the MATH Dataset

Akul Arora, Collin Burns, Dan Hendrycks, Eric Tang, Saurav Kadavath, Steven Basart · 2021 · cs.LG · arXiv 2103.03874

Baseline reference. 54% of citing Pith papers use this work as a benchmark or comparison.

367 Pith papers citing it

Baseline 54% of classified citations

open full Pith review browse 367 citing papers more from Akul Arora arXiv PDF

abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 47 background 32 method 1

citation-polarity summary

use dataset 43 background 32 unclear 4 use method 1

claims ledger

abstract Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are

authors

Akul Arora Collin Burns Dan Hendrycks Eric Tang Saurav Kadavath Steven Basart

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

cs.CL · 2026-04-08 · unverdicted · novelty 8.0

A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

cs.AI · 2026-03-30 · conditional · novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

cs.AI · 2021-08-31 · accept · novelty 8.0

MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

Compositional Generalization in Autoregressive Models via Logit Composition

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

Logit composition of autoregressive models is projective under factorized conditionals, preserved under smooth reparameterizations, and maintains length generalization when assumptions hold uniformly.

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

cs.LG · 2026-05-20 · conditional · novelty 7.0

X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.

Learning How to Cube

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

A neuro-symbolic post-training pipeline lets a 4B transformer learn cubing heuristics that reach pass@5 of 53 on 100 SAT competition instances, matching the strongest symbolic baseline.

citing papers explorer

Showing 50 of 367 citing papers.

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs cs.CL · 2026-04-11 · unverdicted · none · ref 30 · internal anchor
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
StaRPO: Stability-Augmented Reinforcement Policy Optimization cs.AI · 2026-04-10 · unverdicted · none · ref 14 · internal anchor
StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.
Decomposing the Delta: What Do Models Actually Learn from Preference Pairs? cs.CL · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
Larger differences in generator capability between chosen and rejected reasoning traces improve out-of-domain performance, while filtering pairs by sample-level quality deltas enables more data-efficient training.
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior cs.LG · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.
Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis cs.AI · 2026-04-06 · unverdicted · none · ref 5 · internal anchor
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL · 2026-04-03 · unverdicted · none · ref 49 · internal anchor
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation cs.AI · 2026-03-23 · unverdicted · none · ref 69 · internal anchor
SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.
Attention Residuals cs.CL · 2026-03-16 · unverdicted · none · ref 14 · internal anchor
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
LLM Reasoning with Process Rewards for Outcome-Guided Steps cs.LG · 2026-02-08 · unverdicted · none · ref 26 · internal anchor
PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.
MiMo-V2-Flash Technical Report cs.CL · 2026-01-06 · unverdicted · none · ref 23 · internal anchor
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process cs.CL · 2025-12-29 · unverdicted · none · ref 24 · internal anchor
LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
NVIDIA Nemotron 3: Efficient and Open Intelligence cs.CL · 2025-12-24 · unverdicted · none · ref 53 · internal anchor
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training cs.LG · 2025-12-03 · unverdicted · none · ref 11 · internal anchor
DVPO learns token-level value distributions and uses asymmetric risk regularization to contract lower tails while expanding upper tails, outperforming PPO and GRPO under noisy supervision in dialogue, math, and QA tasks.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm cs.CV · 2025-11-06 · unverdicted · none · ref 19 · internal anchor
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA cs.LG · 2025-10-27 · unverdicted · none · ref 10 · internal anchor
GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.
A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning cs.LG · 2025-10-21 · unverdicted · none · ref 9 · 2 links · internal anchor
SePT alternates self-generation of responses at controlled temperatures with training on the latest model outputs, yielding gains over a strong no-training baseline on six math reasoning benchmarks.
EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget cs.CL · 2025-10-07 · unverdicted · none · ref 7 · internal anchor
EEPO uses sample-then-forget rollouts with adaptive unlearning to boost exploration in RLVR, delivering relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base over GRPO across five reasoning benchmarks.
BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning cs.LG · 2025-09-25 · unverdicted · none · ref 17 · internal anchor
BoHA partitions frozen weights into a b by b grid and applies independent low-rank Hadamard factors per block, outperforming LoRA on matched-budget single-task averages while retaining 57.66% first-stage accuracy in a commonsense-to-arithmetic continual-learning test on Llama-3.2-3B.
Early Stopping Chain-of-thoughts in Large Language Models cs.CL · 2025-09-17 · conditional · none · ref 6 · internal anchor
ES-CoT shortens LLM chain-of-thought generation by tracking runs of identical step answers after linguistic markers, cutting tokens 16% on average while keeping accuracy comparable to full CoT across six datasets and three models.
Self-Aligned Reward: Towards Effective and Efficient Reasoners cs.LG · 2025-09-05 · unverdicted · none · ref 18 · internal anchor
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
Proximal Supervised Fine-Tuning cs.LG · 2025-08-25 · unverdicted · none · ref 12 · internal anchor
PSFT modifies supervised fine-tuning by incorporating trust-region ideas from RL to constrain policy changes, yielding better out-of-domain generalization in math and human-value tasks without entropy collapse.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 25 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement cs.CL · 2025-07-14 · unverdicted · none · ref 24 · internal anchor
SMCS coordinates 15 open-source LLMs via retrieval-based prior selection and exploration-exploitation posterior enhancement, outperforming GPT-4.1 by 5.36% and GPT-o3-mini by 5.28% on eight benchmarks.
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning cs.LG · 2025-06-09 · unverdicted · none · ref 4 · internal anchor
Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.
Training and Evaluating Language Models with Template-based Data Generation cs.CL · 2024-11-27 · unverdicted · none · ref 6 · internal anchor
TDG uses GPT-4 to generate meta-templates that synthesize over 7 million verifiable grade school math problems for training and aligning LLMs on reasoning tasks.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence cs.SE · 2024-01-25 · unverdicted · none · ref 10 · internal anchor
DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
Mixtral of Experts cs.LG · 2024-01-08 · unverdicted · none · ref 17 · internal anchor
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
Mistral 7B cs.CL · 2023-10-10 · accept · none · ref 13 · internal anchor
Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 177 · internal anchor
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 173 · internal anchor
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards cs.AI · 2026-06-27 · unverdicted · none · ref 60 · internal anchor
BV-Blend blends prompt-local and semantic-cluster historical reward statistics via SEM-derived weights to stabilize critic-free RL advantage estimation.
HyperDFlash: Hyper-Connection-Aligned Block Speculative Decoding with Gated Residual Reduction cs.LG · 2026-06-25 · unverdicted · none · ref 13 · internal anchor
HyperDFlash improves speculative decoding for hyper-connection LLMs via pre-collapse residual conditioning and a lightweight gated reducer from the target hc_head, outperforming MTP and DFlash in draft acceptance and speedup.
Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation cs.CL · 2026-05-30 · unverdicted · none · ref 14 · internal anchor
DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution cs.LG · 2026-05-29 · unverdicted · none · ref 18 · internal anchor
MetaEvo is a two-stage framework using preference optimization for principle abstraction followed by modular reuse to enable continual improvement of LLM agents on reasoning tasks.
Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization cs.LG · 2026-05-28 · unverdicted · none · ref 20 · internal anchor
FoLoRA applies generalized Rayleigh-quotient optimization to LoRA updates so that directions are gated by downstream utility divided by a pretraining-proxy forgetting penalty.
DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes cs.AI · 2026-05-27 · unverdicted · none · ref 10 · internal anchor
DenoiseRL optimizes recovery from noisy prefixes in weak-model reasoning failures to improve performance and self-correction on math and general reasoning benchmarks without external supervision.
PowLU: An Activation Function for Stable Pre-Training of LLMs cs.CL · 2026-05-25 · unverdicted · none · ref 9 · internal anchor
PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.
Rethinking Wireless Communications through Formal Mathematical AI Reasoning eess.SP · 2026-04-28 · unverdicted · none · ref 24 · internal anchor
Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
EMS: Multi-Agent Voting via Efficient Majority-then-Stopping cs.AI · 2026-04-03 · unverdicted · none · ref 27 · internal anchor
EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.
Ministral 3 cs.CL · 2026-01-13 · unverdicted · none · ref 9 · internal anchor
Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs cs.AI · 2025-11-02 · unverdicted · none · ref 2 · internal anchor
Empirical evaluation on Gemini 2.5 models shows self-consistency yields only 0.4% gain on HotpotQA and 1.6% on MATH-500 across 20 samples while token costs scale linearly, with performance plateauing or declining at higher counts.
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping cs.LG · 2025-10-29 · unverdicted · none · ref 12 · internal anchor
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems physics.ed-ph · 2025-08-05 · conditional · none · ref 59 · internal anchor
An exploratory study finds that a curated core of automated checks on learner-initiated AI-generated physics problems aligns with expert judgments and student preferences.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 91 · internal anchor
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Learning to Reason at the Frontier of Learnability cs.LG · 2025-02-17 · unverdicted · none · ref 10 · internal anchor
A curriculum sampling questions with high variance in success rate improves reinforcement learning performance for LLM reasoning tasks.
Qwen2.5-Coder Technical Report cs.CL · 2024-09-18 · unverdicted · none · ref 21 · internal anchor
Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.
SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration physics.chem-ph · 2024-09-03 · unverdicted · none · ref 79 · internal anchor
SmileyLlama is an LLM transformed via SFT and DPO to generate valid novel drug-like molecules with user-specified properties and optimized 3D conformations for high binding affinity.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 30 · internal anchor
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 127 · internal anchor
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Baichuan 2: Open Large-scale Language Models cs.CL · 2023-09-19 · unverdicted · none · ref 30 · internal anchor
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.

Measuring Mathematical Problem Solving With the MATH Dataset

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer