Data and Evaluation Closed-Loop for Model Capability Enhancement

Han Xu; Jiangan Yuan; Zhixuan Li

arxiv: 2606.28471 · v1 · pith:NDEWNSAFnew · submitted 2026-06-26 · 💻 cs.AI

Data and Evaluation Closed-Loop for Model Capability Enhancement

Zhixuan Li , Jiangan Yuan , Han Xu This is my paper

Pith reviewed 2026-06-30 01:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords capability sliceclosed loopmodel capabilityevaluation taxonomydata taxonomytargeted data interventionbenchmark diagnosisLLM pre-training

0 comments

The pith

Capability slices turn benchmark failures into targeted, testable data interventions via a closed loop of taxonomies and mapping rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that linking evaluation results to data choices in LLM training can be made systematic rather than intuitive by defining capability slices as groups of samples that share background condition, task type, solving operation, and output constraint. A sympathetic reader would care because current practice compresses evaluation into noisy scores and leaves the engineer to guess which data change will fix an observed weakness. The authors supply an evaluation taxonomy, a non-instruction data taxonomy, and explicit mapping rules that together form a closed loop. They demonstrate the loop on two opposing cases: one correctly rules the data out as the cause of a benchmark drop, and the other correctly rules data in by selecting samples that address specific failing operations. The unmodified loop produces auditable, experimentally validated diagnoses in both directions.

Core claim

The capability slice is a group of evaluation samples sharing background condition, task type, solving operation, and output constraint. Built around this unit, an evaluation taxonomy, a non-instruction data taxonomy, and mapping rules form a closed loop turning a benchmark-level failure into a targeted, testable data intervention. The loop was tested on two case studies pulling in opposite directions. First, it rules the data out: continued pre-training drives BBH down by -46.82 percent, but diagnosis traces this to a single masked EOS loss rather than weakened reasoning; restoring it recovers BBH to 66.44 without changing the data. Second, it rules the data in: a persistent math-reasoning

What carries the argument

The capability slice, a group of evaluation samples that share background condition, task type, solving operation, and output constraint, which localizes a single weakness precisely enough for mapping to data while remaining stable under aggregation.

If this is right

A drop in BBH after continued pre-training can be diagnosed as caused by masked EOS loss rather than by the training data itself.
Restoring the masked loss recovers BBH performance above the original checkpoint without any data change.
Decomposing a math-reasoning weakness by solving operation identifies specific failing combinations that can be targeted by sampling.
The same loop can correctly rule data in or out depending on the actual cause of the observed failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The closed loop could be run repeatedly during training so that each evaluation round directly informs the next data selection round.
The approach might allow smaller, more focused data additions to replace large-scale random corpus growth when addressing specific weaknesses.
If the taxonomies are extended to other modalities or model families, the same evaluation-to-data mapping could apply beyond text-only LLMs.

Load-bearing premise

The capability slice must be precise enough to localize one weakness yet stable enough to survive aggregation, rather than too coarse like a benchmark name or too noisy like a single sample.

What would settle it

Apply the closed loop to a new benchmark failure, execute the recommended data intervention or non-intervention, and check whether the targeted capability changes exactly as the diagnosis predicted.

Figures

Figures reproduced from arXiv: 2606.28471 by Han Xu, Jiangan Yuan, Zhixuan Li.

**Figure 1.** Figure 1: Overview of the analysis toolkit. The toolkit translates noisy benchmark-level observations into structured [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗

**Figure 2.** Figure 2: Character-length quantiles of resps and filtered_resps on the audited BBH subset, for the warm-start and base checkpoints. This characterization confirms that, by construction, BBH is a strict-output benchmark: 87.04% of its instances carry the highest format-rigidity level (format_rigidity = 3), all instances carry the highest exactness requirement (exactness_requirement = 3), and 93.67% are scored by exa… view at source ↗

**Figure 3.** Figure 3: Character-length quantiles of resps and filtered_resps on the audited BBH subset, for the warm-start, base, and exp checkpoints. on the subsets most affected by the parser-sensitive failure mode identified earlier: accuracy on mcq_single examples rises from 21.18% to 66.28%, accuracy on closed_set examples rises from 25.86% to 69.09%, and accuracy on the dominant high-rigidity subset (format_rigidity = 3) … view at source ↗

**Figure 4.** Figure 4: Operation-level accuracy change on the standard-suite mathematical evaluation set [PITH_FULL_IMAGE:figures/full_fig_p045_4.png] view at source ↗

**Figure 5.** Figure 5: Operation-level accuracy change over the top-10 highest-scoring operation pairs in [PITH_FULL_IMAGE:figures/full_fig_p046_5.png] view at source ↗

**Figure 6.** Figure 6: Operation-level accuracy change over the top-10 highest-scoring operation triples in [PITH_FULL_IMAGE:figures/full_fig_p046_6.png] view at source ↗

**Figure 7.** Figure 7: Operation-level accuracy change on MATH500 under the zero-shot Pass@128 protocol, over the same top-7 [PITH_FULL_IMAGE:figures/full_fig_p047_7.png] view at source ↗

**Figure 8.** Figure 8: Operation-level accuracy change on MATH500 under Pass@128, over the same top-10 operation pairs as in [PITH_FULL_IMAGE:figures/full_fig_p047_8.png] view at source ↗

**Figure 9.** Figure 9: Operation-level accuracy change on MATH500 under Pass@128, over the same top-10 operation triples as in [PITH_FULL_IMAGE:figures/full_fig_p048_9.png] view at source ↗

**Figure 10.** Figure 10: Pairwise Composition Effect of arithmetic_computation Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. In International Conference on Learning Representations, 2024. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B… view at source ↗

**Figure 11.** Figure 11: Pairwise Composition Effect of concept_alignment 70 60 50 40 30 20 10 0 10 pair(u; oi) physical_state_simulation boundary_case_reasoning constraint_tracking symbolic_transformation counting equation_formulation compare_or_rank concept_alignment fact_recall arithmetic_computation -63.0% -51.6% -46.7% -35.8% -32.4% -27.7% -27.5% -26.6% -19.1% +0.7% multi_hop_composition | acc=40.24% [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 12.** Figure 12: Pairwise Composition Effect of multi_hop_composition 54 [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗

**Figure 13.** Figure 13: Pairwise Composition Effect of equation_formulation 60 40 20 0 pair(u; oi) physical_state_simulation boundary_case_reasoning constraint_tracking multi_hop_composition counting equation_formulation concept_alignment compare_or_rank arithmetic_computation fact_recall -65.1% -44.0% -41.9% -33.4% -23.9% -15.0% -12.1% -3.8% -2.2% +5.6% symbolic_transformation | acc=38.75% [PITH_FULL_IMAGE:figures/full_fig_p05… view at source ↗

**Figure 14.** Figure 14: Pairwise Composition Effect of symbolic_transformation 55 [PITH_FULL_IMAGE:figures/full_fig_p055_14.png] view at source ↗

**Figure 15.** Figure 15: Pairwise Composition Effect of fact_recall 60 50 40 30 20 10 0 10 20 pair(u; oi) physical_state_simulation fact_recall multi_hop_composition equation_formulation boundary_case_reasoning counting symbolic_transformation concept_alignment arithmetic_computation compare_or_rank -51.0% -16.3% -12.1% -11.0% -10.4% -10.0% -7.7% -7.6% -1.8% +9.6% constraint_tracking | acc=24.39% [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 16.** Figure 16: Pairwise Composition Effect of constraint_tracking 56 [PITH_FULL_IMAGE:figures/full_fig_p056_16.png] view at source ↗

**Figure 17.** Figure 17: Pairwise Composition Effect of counting 80 60 40 20 0 20 pair(u; oi) option_elimination multi_hop_composition equation_formulation symbolic_transformation constraint_tracking counting arithmetic_computation concept_alignment fact_recall compare_or_rank -67.6% -24.3% -18.7% -15.7% -15.1% -12.1% -5.7% -2.8% +3.4% +9.7% boundary_case_reasoning | acc=25.74% [PITH_FULL_IMAGE:figures/full_fig_p057_17.png] view at source ↗

**Figure 18.** Figure 18: Pairwise Composition Effect of boundary_case_reasoning 57 [PITH_FULL_IMAGE:figures/full_fig_p057_18.png] view at source ↗

**Figure 19.** Figure 19: Pairwise Composition Effect of compare_or_rank 10 8 6 4 2 0 2 C(u; oi, oj) (multi_hop_composition, fact_recall) (multi_hop_composition, symbolic_transformation) (multi_hop_composition, equation_formulation) (multi_hop_composition, compare_or_rank) (concept_alignment, multi_hop_composition) (equation_formulation, fact_recall) (equation_formulation, counting) (fact_recall, constraint_tracking) (concept_alig… view at source ↗

**Figure 20.** Figure 20: Non-additive Composition Effect of arithmetic_computation 58 [PITH_FULL_IMAGE:figures/full_fig_p058_20.png] view at source ↗

**Figure 21.** Figure 21: Non-additive Composition Effect of concept_alignment 4 2 0 2 4 6 8 10 C(u; oi, oj) (arithmetic_computation, boundary_case_reasoning) (arithmetic_computation, symbolic_transformation) (arithmetic_computation, equation_formulation) (arithmetic_computation, fact_recall) (arithmetic_computation, concept_alignment) (arithmetic_computation, constraint_tracking) (arithmetic_computation, counting) (arithmetic_com… view at source ↗

**Figure 22.** Figure 22: Non-additive Composition Effect of multi_hop_composition 59 [PITH_FULL_IMAGE:figures/full_fig_p059_22.png] view at source ↗

**Figure 23.** Figure 23: Non-additive Composition Effect of equation_formulation 30 20 10 0 C(u; oi, oj) (fact_recall, compare_or_rank) (counting, compare_or_rank) (concept_alignment, compare_or_rank) (boundary_case_reasoning, compare_or_rank) (arithmetic_computation, counting) (arithmetic_computation, boundary_case_reasoning) (equation_formulation, fact_recall) (fact_recall, constraint_tracking) (arithmetic_computation, constrai… view at source ↗

**Figure 24.** Figure 24: Non-additive Composition Effect of symbolic_transformation 60 [PITH_FULL_IMAGE:figures/full_fig_p060_24.png] view at source ↗

**Figure 25.** Figure 25: Non-additive Composition Effect of fact_recall 8 6 4 2 0 C(u; oi, oj) (fact_recall, counting) (multi_hop_composition, compare_or_rank) (counting, boundary_case_reasoning) (symbolic_transformation, compare_or_rank) (boundary_case_reasoning, compare_or_rank) (multi_hop_composition, boundary_case_reasoning) (arithmetic_computation, symbolic_transformation) (multi_hop_composition, symbolic_transformation) (co… view at source ↗

**Figure 26.** Figure 26: Non-additive Composition Effect of constraint_tracking 61 [PITH_FULL_IMAGE:figures/full_fig_p061_26.png] view at source ↗

**Figure 27.** Figure 27: Non-additive Composition Effect of counting 25 20 15 10 5 0 C(u; oi, oj) (symbolic_transformation, compare_or_rank) (multi_hop_composition, compare_or_rank) (equation_formulation, counting) (fact_recall, constraint_tracking) (concept_alignment, compare_or_rank) (equation_formulation, compare_or_rank) (arithmetic_computation, compare_or_rank) (constraint_tracking, compare_or_rank) (arithmetic_computation, … view at source ↗

**Figure 28.** Figure 28: Non-additive Composition Effect of boundary_case_reasoning 62 [PITH_FULL_IMAGE:figures/full_fig_p062_28.png] view at source ↗

**Figure 29.** Figure 29: Non-additive Composition Effect of compare_or_rank 63 [PITH_FULL_IMAGE:figures/full_fig_p063_29.png] view at source ↗

read the original abstract

Model capability is the central variable in LLM pre-training, yet is never observed directly: data shapes it prospectively, while evaluation reveals it only retrospectively, compressing samples, prompts, decoding, and scoring rules into one noisy score. Practical optimization runs this backward: a failure is observed first, and the engineer must infer the corpus fix. The two sides speak incompatible vocabularies -- benchmark names and per-sample correctness versus data sources, domains, and quality labels -- so this inference is usually intuition, not method. We close this gap with the \emph{capability slice}: a group of evaluation samples sharing background condition, task type, solving operation, and output constraint -- precise enough to localize a single weakness yet stable enough to survive aggregation, unlike a benchmark name, too coarse, or a single sample, too noisy. Built around this unit, an evaluation taxonomy, a non-instruction data taxonomy, and mapping rules form a closed loop turning a benchmark-level failure into a targeted, testable data intervention. We test this loop on two case studies pulling in opposite directions. First, the loop rules the data out: continued pre-training drives BBH down by $-46.82\%$, but diagnosis traces this to a single masked \texttt{\textless EOS\textgreater} loss rather than weakened reasoning; restoring it recovers BBH to $66.44$, above the original checkpoint, without changing the data. Second, the loop rules the data in: a persistent math-reasoning weakness is decomposed by solving operation into specific failing combinations, and a weakness-targeted sampling procedure built from it lifts AIME2025/AIME2026 Pass@128 from $6.67$/$0.00$ to $26.67$ each. The same unmodified loop reaches opposite, correct verdicts in both cases, showing the evaluation-to-data inference can be routine, auditable, and experimentally validated rather than intuitive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A structured eval-to-data loop with working case studies, but needs checks on slice stability.

read the letter

The punchline for you is that this paper gives a workable method for going from an evaluation failure to a targeted data intervention without relying on intuition. They introduce capability slices as the key unit and build taxonomies and rules around it.

The paper does a good job showing the loop in action with two case studies that pull in opposite directions. One rules the data out after a pre-training run tanks BBH, tracing it to an EOS loss instead, and recovers the score. The other rules data in for math reasoning, breaking it down by solving operations and using targeted sampling to raise AIME2025 and 2026 pass rates from 6.67 and 0 to 26.67. That's solid evidence that the mapping can be validated experimentally.

What's new is the explicit combination of the slice definition, dual taxonomies, and rules to make the process auditable. Prior work is referenced but this structured technique isn't there.

The main concern is whether the slices are stable. The paper asserts they localize weaknesses precisely yet survive aggregation, but there are no numbers on agreement between annotators or how much the diagnosis changes with different groupings. The case studies succeed, but that doesn't prove the method is independent of the human choices made here.

This paper is for researchers and engineers working on LLM capability improvement through data. It deserves serious referee time because the framework addresses a real bottleneck with concrete examples, though additional checks on slice robustness would help.

Referee Report

2 major / 1 minor

Summary. The paper claims that a 'capability slice' (grouping evaluation samples by shared background condition, task type, solving operation, and output constraint) combined with an evaluation taxonomy, non-instruction data taxonomy, and mapping rules forms a closed loop that converts benchmark-level LLM failures into targeted, testable data interventions. This is demonstrated in two case studies reaching opposite but correct verdicts: one rules data out as the cause of a BBH drop (attributing it to masked EOS loss instead) and recovers performance without data changes; the other rules data in to improve AIME math reasoning via weakness-targeted sampling, lifting Pass@128 scores from 6.67/0.00 to 26.67.

Significance. If the central claim holds, the framework provides a systematic alternative to intuitive diagnosis of pre-training issues, with the experimental validation across opposing outcomes serving as a notable strength. The approach could support more auditable capability enhancement if the slice definitions prove stable and reproducible.

major comments (2)

[Abstract] Abstract: The claim that capability slices are 'precise enough to localize a single weakness yet stable enough to survive aggregation' (unlike benchmark names or single samples) is load-bearing for the assertion that the loop is 'routine, auditable, and experimentally validated rather than intuitive,' yet no quantitative checks are reported, such as inter-annotator agreement on slice boundaries or sensitivity of diagnoses to small re-groupings of samples.
[Case studies] Case studies (as summarized in Abstract): Neither case study reports data exclusion criteria, statistical controls, or full methods for defining and applying the capability slices and mapping rules, which limits assessment of whether the diagnoses are reproducible or dependent on the particular choices made in these examples.

minor comments (1)

[Abstract] Abstract: The numerical results (e.g., BBH drop of -46.82% and recovery to 66.44) should be explicitly cross-referenced to the corresponding tables or figures in the main text for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments and the recommendation for major revision. We address each major comment point by point below, and agree to incorporate revisions to enhance the reproducibility and quantitative validation of the proposed framework.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that capability slices are 'precise enough to localize a single weakness yet stable enough to survive aggregation' (unlike benchmark names or single samples) is load-bearing for the assertion that the loop is 'routine, auditable, and experimentally validated rather than intuitive,' yet no quantitative checks are reported, such as inter-annotator agreement on slice boundaries or sensitivity of diagnoses to small re-groupings of samples.

Authors: We agree that quantitative checks on slice stability would strengthen the central claim. The manuscript defines the slices via explicit taxonomies and rules, and the case studies apply them consistently to reach correct but opposing conclusions. However, no inter-annotator agreement or sensitivity analysis is currently reported. In the revised manuscript, we will add a dedicated subsection under the methodology that reports inter-annotator agreement on a sample of slices and sensitivity tests to small re-groupings. revision: yes
Referee: [Case studies] Case studies (as summarized in Abstract): Neither case study reports data exclusion criteria, statistical controls, or full methods for defining and applying the capability slices and mapping rules, which limits assessment of whether the diagnoses are reproducible or dependent on the particular choices made in these examples.

Authors: The full definitions of the evaluation taxonomy, non-instruction data taxonomy, and mapping rules are provided in Sections 3-5 of the manuscript, along with the specific application to each case study. Data exclusion criteria are noted in the experimental details for BBH (e.g., filtering for certain task types) and AIME. That said, we acknowledge that the presentation could be more explicit regarding statistical controls and step-by-step application procedures. We will revise the case study sections to include expanded methods descriptions, explicit data exclusion criteria, and any statistical controls used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via explicit definitions and empirical tests

full rationale

The paper defines the capability slice explicitly as a group sharing background condition, task type, solving operation, and output constraint, then builds an evaluation taxonomy, non-instruction data taxonomy, and mapping rules around this unit to form the closed loop. The central claim—that the unmodified loop produces routine, auditable, opposite-correct verdicts—is supported by two independent case studies with concrete experimental outcomes (BBH recovery to 66.44 without data change; AIME Pass@128 lift to 26.67). No equations, fitted parameters, or predictions reduce by construction to inputs; no self-citations are invoked as load-bearing uniqueness theorems; the validation rests on external empirical results rather than tautology or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The framework rests on the introduction of three new conceptual entities and one core domain assumption about slice properties; no numeric free parameters are described in the abstract.

axioms (1)

domain assumption Capability slices can be defined to be both precise enough to localize a single weakness and stable enough to survive aggregation
This property is required for the slice to serve as the load-bearing unit of the closed loop.

invented entities (3)

capability slice no independent evidence
purpose: Group of evaluation samples sharing background condition, task type, solving operation, and output constraint
Central new unit introduced to bridge evaluation and data.
evaluation taxonomy no independent evidence
purpose: Organize evaluation samples into capability slices
Component of the closed-loop framework.
non-instruction data taxonomy no independent evidence
purpose: Categorize training data sources and quality for mapping
Component of the closed-loop framework.

pith-pipeline@v0.9.1-grok · 5876 in / 1399 out tokens · 50577 ms · 2026-06-30T01:29:44.375342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 11 canonical work pages · 8 internal anchors

[1]

Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=

Real-time segmentation of on-line handwritten arabic script , author=. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=. 2014 , organization=

2014
[2]

Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=

Fast classification of handwritten on-line Arabic characters , author=. Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=. 2014 , organization=

2014
[3]

Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications

Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications , author=. arXiv preprint arXiv:1804.09028 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

PIQA: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=
[6]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Proceedings of NAACL-HLT , year=

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. Proceedings of NAACL-HLT , year=
[9]

Information Processing Letters , volume=

Weighted random sampling with a reservoir , author=. Information Processing Letters , volume=
[10]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI , author=. arXiv preprint arXiv:2411.04872 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

International Conference on Learning Representations , year=

Measuring massive multitask language understanding , author=. International Conference on Learning Representations , year=
[12]

Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year=

Measuring mathematical problem solving with the MATH dataset , author=. Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year=
[13]

Advances in Neural Information Processing Systems , year=

C-Eval: A multi-level multi-discipline Chinese evaluation suite for foundation models , author=. Advances in Neural Information Processing Systems , year=
[14]

Proceedings of ACL , year=

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension , author=. Proceedings of ACL , year=
[15]

Proceedings of EMNLP , year=

RACE: Large-scale reading comprehension dataset from examinations , author=. Proceedings of EMNLP , year=
[16]

Findings of ACL , year=

CMMLU: Measuring massive multitask language understanding in Chinese , author=. Findings of ACL , year=
[17]

DataComp-LM: In search of the next generation of training sets for language models

DataComp-LM: In search of the next generation of training sets for language models , author=. arXiv preprint arXiv:2406.11794 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2407.01492 , year=

RegMix: Data mixture as regression for language model pre-training , author=. arXiv preprint arXiv:2407.01492 , year=

work page arXiv
[19]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The FineWeb datasets: Decanting the web for the finest text data at scale , author=. arXiv preprint arXiv:2406.17557 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

First Conference on Language Modeling , year=

GPQA: A graduate-level Google-proof Q&A benchmark , author=. First Conference on Language Modeling , year=
[21]

Findings of ACL , year=

Challenging BIG-Bench tasks and whether chain-of-thought can solve them , author=. Findings of ACL , year=
[22]

2023 , howpublished=

Stanford Alpaca: An instruction-following LLaMA model , author=. 2023 , howpublished=

2023
[23]

Proceedings of ACL , year=

Self-Instruct: Aligning language models with self-generated instructions , author=. Proceedings of ACL , year=
[24]

Advances in Neural Information Processing Systems , year=

How far can camels go? Exploring the state of instruction tuning on open resources , author=. Advances in Neural Information Processing Systems , year=
[25]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark , author=. arXiv preprint arXiv:2406.01574 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Advances in Neural Information Processing Systems , year=

DoReMi: Optimizing data mixtures speeds up language model pretraining , author=. Advances in Neural Information Processing Systems , year=
[27]

International Conference on Learning Representations , year=

WizardLM: Empowering large language models to follow complex instructions , author=. International Conference on Learning Representations , year=
[28]

arXiv preprint arXiv:2403.16952 , year=

Data mixing laws: Optimizing data mixtures by predicting language modeling performance , author=. arXiv preprint arXiv:2403.16952 , year=

work page arXiv
[29]

Proceedings of ACL , year=

HellaSwag: Can a machine really finish your sentence? , author=. Proceedings of ACL , year=
[30]

Findings of NAACL , year=

AGIEval: A human-centric benchmark for evaluating foundation models , author=. Findings of NAACL , year=
[31]

Efraimidis and Paul G

Pavlos S. Efraimidis and Paul G. Spirakis , keywords =. Weighted random sampling with a reservoir , journal =. 2006 , issn =. doi:https://doi.org/10.1016/j.ipl.2005.11.003 , url =

work page doi:10.1016/j.ipl.2005.11.003 2006

[1] [1]

Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=

Real-time segmentation of on-line handwritten arabic script , author=. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=. 2014 , organization=

2014

[2] [2]

Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=

Fast classification of handwritten on-line Arabic characters , author=. Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=. 2014 , organization=

2014

[3] [3]

Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications

Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications , author=. arXiv preprint arXiv:1804.09028 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

PIQA: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

[6] [6]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Proceedings of NAACL-HLT , year=

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. Proceedings of NAACL-HLT , year=

[9] [9]

Information Processing Letters , volume=

Weighted random sampling with a reservoir , author=. Information Processing Letters , volume=

[10] [10]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI , author=. arXiv preprint arXiv:2411.04872 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

International Conference on Learning Representations , year=

Measuring massive multitask language understanding , author=. International Conference on Learning Representations , year=

[12] [12]

Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year=

Measuring mathematical problem solving with the MATH dataset , author=. Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year=

[13] [13]

Advances in Neural Information Processing Systems , year=

C-Eval: A multi-level multi-discipline Chinese evaluation suite for foundation models , author=. Advances in Neural Information Processing Systems , year=

[14] [14]

Proceedings of ACL , year=

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension , author=. Proceedings of ACL , year=

[15] [15]

Proceedings of EMNLP , year=

RACE: Large-scale reading comprehension dataset from examinations , author=. Proceedings of EMNLP , year=

[16] [16]

Findings of ACL , year=

CMMLU: Measuring massive multitask language understanding in Chinese , author=. Findings of ACL , year=

[17] [17]

DataComp-LM: In search of the next generation of training sets for language models

DataComp-LM: In search of the next generation of training sets for language models , author=. arXiv preprint arXiv:2406.11794 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2407.01492 , year=

RegMix: Data mixture as regression for language model pre-training , author=. arXiv preprint arXiv:2407.01492 , year=

work page arXiv

[19] [19]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The FineWeb datasets: Decanting the web for the finest text data at scale , author=. arXiv preprint arXiv:2406.17557 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

First Conference on Language Modeling , year=

GPQA: A graduate-level Google-proof Q&A benchmark , author=. First Conference on Language Modeling , year=

[21] [21]

Findings of ACL , year=

Challenging BIG-Bench tasks and whether chain-of-thought can solve them , author=. Findings of ACL , year=

[22] [22]

2023 , howpublished=

Stanford Alpaca: An instruction-following LLaMA model , author=. 2023 , howpublished=

2023

[23] [23]

Proceedings of ACL , year=

Self-Instruct: Aligning language models with self-generated instructions , author=. Proceedings of ACL , year=

[24] [24]

Advances in Neural Information Processing Systems , year=

How far can camels go? Exploring the state of instruction tuning on open resources , author=. Advances in Neural Information Processing Systems , year=

[25] [25]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark , author=. arXiv preprint arXiv:2406.01574 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Advances in Neural Information Processing Systems , year=

DoReMi: Optimizing data mixtures speeds up language model pretraining , author=. Advances in Neural Information Processing Systems , year=

[27] [27]

International Conference on Learning Representations , year=

WizardLM: Empowering large language models to follow complex instructions , author=. International Conference on Learning Representations , year=

[28] [28]

arXiv preprint arXiv:2403.16952 , year=

Data mixing laws: Optimizing data mixtures by predicting language modeling performance , author=. arXiv preprint arXiv:2403.16952 , year=

work page arXiv

[29] [29]

Proceedings of ACL , year=

HellaSwag: Can a machine really finish your sentence? , author=. Proceedings of ACL , year=

[30] [30]

Findings of NAACL , year=

AGIEval: A human-centric benchmark for evaluating foundation models , author=. Findings of NAACL , year=

[31] [31]

Efraimidis and Paul G

Pavlos S. Efraimidis and Paul G. Spirakis , keywords =. Weighted random sampling with a reservoir , journal =. 2006 , issn =. doi:https://doi.org/10.1016/j.ipl.2005.11.003 , url =

work page doi:10.1016/j.ipl.2005.11.003 2006