super hub Baseline reference

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Hyung Won Chung, Mirac Suzgun, Nathan Scales, Sebastian Gehrmann, Yi Tay · 2022 · cs.CL · arXiv 2210.09261

Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.

104 Pith papers citing it

Baseline 55% of classified citations

open full Pith review browse 104 citing papers more from Hyung Won Chung arXiv PDF

abstract

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 12 background 9 method 1

citation-polarity summary

use dataset 12 background 10

claims ledger

abstract BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we

authors

Hyung Won Chung Mirac Suzgun Nathanael Sch\"arli Nathan Scales Sebastian Gehrmann Yi Tay

co-cited works

representative citing papers

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

Large Language Model Selection with Limited Annotations

cs.CL · 2026-05-24 · unverdicted · novelty 7.0

SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Logic-Regularized Verifier Elicits Reasoning from LLMs

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

Verifiable Counterfactual Supervision for Process Reward Models

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

Presents verifiable counterfactual process supervision that generates annotated trajectories via template-aware error injection on symbolic chains, improving Best-of-8 reranking on logical reasoning benchmarks with preliminary math transfer.

Large Language Models Exhibit Normative Conformity

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

Large language models exhibit normative conformity in addition to informational conformity, and subtle social context can direct which group they conform to.

Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.

MARS: Enabling Autoregressive Models Multi-Token Generation

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

cs.CL · 2026-03-11 · unverdicted · novelty 7.0

PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.

Scaling Latent Reasoning via Looped Language Models

cs.CL · 2025-10-29 · unverdicted · novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

cs.CL · 2025-02-04 · unverdicted · novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

cs.CL · 2023-09-15 · unverdicted · novelty 7.0

EvoPrompt uses LLMs to run evolutionary operators on populations of prompts, outperforming human-engineered prompts by up to 25% on BIG-Bench Hard tasks across 31 datasets.

Large Language Models as Optimizers

cs.LG · 2023-09-07 · unverdicted · novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

cs.CL · 2022-11-22 · unverdicted · novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

cs.AI · 2026-06-04 · conditional · novelty 6.0

Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.

citing papers explorer

Showing 50 of 104 citing papers.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 113 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cs.CL · 2025-02-16 · unverdicted · none · ref 73 · internal anchor
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 131 · internal anchor
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 225 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 89 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 271 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence cs.SE · 2024-06-17 · unverdicted · none · ref 23 · internal anchor
An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 33 · internal anchor
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 125 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models cs.CL · 2024-02-05 · unverdicted · none · ref 43 · internal anchor
DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
Gemini: A Family of Highly Capable Multimodal Models cs.CL · 2023-12-19 · conditional · none · ref 100 · internal anchor
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning cs.CL · 2023-09-11 · conditional · none · ref 41 · internal anchor
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
Simple synthetic data reduces sycophancy in large language models cs.CL · 2023-08-07 · unverdicted · none · ref 40 · internal anchor
Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 cs.CL · 2023-06-05 · conditional · none · ref 11 · internal anchor
A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.
Teaching Large Language Models to Self-Debug cs.CL · 2023-04-11 · unverdicted · none · ref 126 · internal anchor
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 111 · internal anchor
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 139 · internal anchor
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning cs.AI · 2023-01-31 · conditional · none · ref 55 · internal anchor
The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.
Large Language Models Are Human-Level Prompt Engineers cs.LG · 2022-11-03 · unverdicted · none · ref 33 · internal anchor
APE generates instruction candidates via LLM and selects the best by zero-shot performance of a second LLM, matching or beating human prompts on 19 of 24 NLP tasks.
Emergent Abilities of Large Language Models cs.CL · 2022-06-15 · unverdicted · none · ref 81 · internal anchor
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety cs.CL · 2026-06-26 · unverdicted · none · ref 31 · internal anchor
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling cs.LG · 2026-05-26 · unverdicted · none · ref 14 · internal anchor
Dense2MoE unifies pruning of attention modules with upcycling of MLPs into MoE experts to produce on-device LLMs that improve the latency-accuracy Pareto frontier.
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training cs.LG · 2026-05-25 · unverdicted · none · ref 22 · internal anchor
GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.
The Efficiency Gap in Byte Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 42 · internal anchor
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability cs.AI · 2026-05-11 · unverdicted · none · ref 27 · internal anchor
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 58 · 2 links · internal anchor
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling cs.AI · 2026-05-02 · unverdicted · none · ref 11 · internal anchor
Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while self-consistency saturates.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference cs.LG · 2026-04-24 · unverdicted · none · ref 32 · internal anchor
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and dual quantization paths.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods cs.LG · 2026-04-19 · unverdicted · none · ref 40 · internal anchor
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs cs.LG · 2026-04-08 · unverdicted · none · ref 70 · internal anchor
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
Attention Residuals cs.CL · 2026-03-16 · unverdicted · none · ref 49 · internal anchor
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm cs.CV · 2025-11-06 · unverdicted · none · ref 31 · internal anchor
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 71 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 110 · internal anchor
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 77 · internal anchor
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence cs.SE · 2024-01-25 · unverdicted · none · ref 16 · internal anchor
DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cs.CL · 2024-01-11 · unverdicted · none · ref 96 · internal anchor
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
Mixtral of Experts cs.LG · 2024-01-08 · unverdicted · none · ref 29 · internal anchor
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration cs.CL · 2023-11-07 · unverdicted · none · ref 57 · internal anchor
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
Mistral 7B cs.CL · 2023-10-10 · accept · none · ref 23 · internal anchor
Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 141 · internal anchor
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 91 · 2 links · internal anchor
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead cs.LG · 2025-07-30 · unverdicted · none · ref 76 · internal anchor
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 131 · internal anchor
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model cs.CV · 2025-02-14 · unverdicted · none · ref 239 · internal anchor
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 76 · internal anchor
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 110 · internal anchor
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Baichuan 2: Open Large-scale Language Models cs.CL · 2023-09-19 · unverdicted · none · ref 65 · internal anchor
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
Mellum2 Technical Report cs.CL · 2026-05-29 · unverdicted · none · ref 70 · internal anchor
Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer