hub Canonical reference

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, Junxian He · 2023 · arXiv 2305.08322

Canonical reference. 71% of citing Pith papers cite this work as background.

19 Pith papers citing it

Background 71% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1 method 1

citation-polarity summary

background 5 baseline 1 use method 1

representative citing papers

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

cs.CV · 2023-10-03 · accept · novelty 8.0

MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

cs.AI · 2024-10-06 · unverdicted · novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

cs.AI · 2026-05-31 · unverdicted · novelty 6.0

ANDES equips AI agents with an interactive data-synthesis skill using World Tree routing to reach SOTA automated alignment on PostTrainBench under compute limits.

Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

cs.CL · 2026-04-07 · conditional · novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

cs.SE · 2024-06-17 · unverdicted · novelty 6.0

An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

cs.LG · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.

Kimi K2: Open Agentic Intelligence

cs.LG · 2025-07-28 · unverdicted · novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

Hallucination of Multimodal Large Language Models: A Survey

cs.CV · 2024-04-29 · accept · novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

InternLM2 Technical Report

cs.CL · 2024-03-26 · unverdicted · novelty 5.0

InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

cs.CL · 2024-01-11 · unverdicted · novelty 5.0

DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

cs.CL · 2023-11-09 · unverdicted · novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

cs.CL · 2026-04-22 · unverdicted · novelty 4.0

Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.

Yi: Open Foundation Models by 01.AI

cs.CL · 2024-03-07 · unverdicted · novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

cs.CL · 2024-01-05 · unverdicted · novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

Baichuan 2: Open Large-scale Language Models

cs.CL · 2023-09-19 · unverdicted · novelty 4.0

Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

cs.CL · 2024-09-10 · unverdicted · novelty 2.0

Empirical practice of continual pre-training Llama-3 models with optimized additional language mixture ratios to enhance Chinese capabilities, showing gains in benchmarks and domains like math and coding.

citing papers explorer

Showing 19 of 19 citing papers.

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts cs.CV · 2023-10-03 · accept · none · ref 4
MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability cs.CL · 2026-05-12 · unverdicted · none · ref 44
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 17
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 25
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment cs.AI · 2026-05-31 · unverdicted · none · ref 12
ANDES equips AI agents with an interactive data-synthesis skill using World Tree routing to reach SOTA automated alignment on PostTrainBench under compute limits.
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion cs.CL · 2026-04-07 · conditional · none · ref 28
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence cs.SE · 2024-06-17 · unverdicted · none · ref 10
An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 34 · 2 links
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing cs.LG · 2026-05-07 · unverdicted · none · ref 9
NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 31
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 72
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 182
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cs.CL · 2024-01-11 · unverdicted · none · ref 25
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 129
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection cs.CL · 2026-04-22 · unverdicted · none · ref 48
Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 33
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 26
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Baichuan 2: Open Large-scale Language Models cs.CL · 2023-09-19 · unverdicted · none · ref 33
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio cs.CL · 2024-09-10 · unverdicted · none · ref 30
Empirical practice of continual pre-training Llama-3 models with optimized additional language mixture ratios to enhance Chinese capabilities, showing gains in benchmarks and domains like math and coding.

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer