Epiphany-Aware KV Cache Eviction Without the Attention Matrix

Steven Kolawole; Virginia Smith

arxiv: 2606.26472 · v1 · pith:OUWKPRPMnew · submitted 2026-06-25 · 💻 cs.LG · cs.CL

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

Steven Kolawole , Virginia Smith This is my paper

Pith reviewed 2026-06-26 05:45 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords KV cache evictionepiphany scorelong-context reasoningattention-free scoringFlashAttention compatibilityMATH-500 evaluationAIME-2024internal representation change

0 comments

The pith

Tokens for KV cache eviction can be ranked by the change in a model's internal representation during the forward pass, without ever computing attention weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an epiphany score, computed as the shift in internal activations from one forward pass to the next, serves as a practical substitute for attention weights when deciding which tokens to drop from the KV cache. Because the score comes directly from the model's existing state, the method avoids materializing the full attention matrix and therefore works unchanged inside fused kernels such as FlashAttention. On MATH-500 with a 4096-token cache the resulting EpiKV policy reaches 72 percent accuracy, matching the best attention-based baselines while supporting contexts up to sixteen times longer. A lag-normalized variant further improves results on AIME-2024 at 8192 tokens. The approach requires no training and no custom kernels.

Core claim

EpiKV replaces attention-based token ranking with the epiphany score, defined as the change in the model's internal representation read directly from the forward pass. After subtracting a causal rolling z-score to remove positional trends and focusing on upper-mid layers, tokens with the lowest scores are evicted. This yields a cache-eviction policy that matches ThinKV (71 percent) and exceeds H2O (67 percent) at 72 percent on MATH-500 with a 4096-token budget, reaches 37 percent on AIME-2024 at 8192 tokens, and extends feasible context length by a factor of 16 relative to attention-matrix methods, all without training or custom kernels.

What carries the argument

The epiphany score: the change in the model's internal representation during the forward pass, used as a direct proxy for token importance that bypasses the attention matrix entirely.

If this is right

EpiKV integrates directly into existing FlashAttention inference pipelines without any code changes or extra kernels.
The method extends the usable context length for reasoning models by roughly sixteen times compared with attention-matrix eviction policies.
A lag-normalized version of the same score reaches 37 percent on AIME-2024 at 8192 tokens while running up to 2.8 times faster than the best attention baseline.
No additional training data or classifier is needed to achieve parity with attention-based methods on MATH-500.
The policy works at cache sizes where attention-based scoring becomes infeasible because the attention matrix cannot be materialized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same internal-state delta could be monitored at additional layers or checkpoints to refine eviction decisions on even longer traces.
Because the signal is independent of attention, it may combine with hardware-specific kernel optimizations that currently cannot expose attention weights.
If the epiphany score proves stable across model families, it offers a path to cache management that does not require architecture-specific attention implementations.
The approach suggests testing whether similar activation-change signals can guide other memory-management decisions such as prompt compression or layer-wise pruning.

Load-bearing premise

Changes observed in the model's internal activations during a single forward pass reliably mark which tokens matter most for the eventual answer.

What would settle it

A controlled experiment in which EpiKV is run on the same long reasoning traces as ThinKV or H2O but produces measurably lower final accuracy once the cache size drops below the tested 4096-token limit.

Figures

Figures reproduced from arXiv: 2606.26472 by Steven Kolawole, Virginia Smith.

**Figure 2.** Figure 2: How the importance score is computed at one decode step. A single forward pass over the model’s [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy vs. cache budget on MATH-500 (left, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy vs. wall-clock time per problem [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Per-layer Spearman ρ between rolling-64 hidden-state change and counterfactual importance. Band A (7–13) is positive and Band B (18–25) negative across both datasets and back-ends. is located by searching for the </think> token sequence, falling back to the last \boxed{ and then to the final 64 tokens. For each window the tokens are replaced with the padding id, the full modified context up to the boundary… view at source ↗

**Figure 6.** Figure 6: Maximum concurrent requests on an 80 GB GPU vs. context length. Without eviction, capacity falls as [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

As reasoning models emit chains of thought tens of thousands of tokens long, KV cache increasingly becomes a deployment bottleneck. Existing cache eviction methods rank tokens by attention weight, which is a noisy importance proxy in long reasoning traces, and prohibits the use of fused kernels in production inference by forcing the model to materialize the attention matrix. In this work, we instead score tokens with a metric we term the epiphany score: the change in the model's internal representation, read directly from the forward pass with no attention matrix and negligible extra state. Our resulting cache eviction method, EpiKV, requires no training, classifier, or custom kernel, and can be used directly in FlashAttention inference stacks unchanged -- scaling to a 16x longer feasible context than attention-based scoring. upper-mid layers negatively) and remove a positional trend with a causal rolling z-score. At a 4096-token cache EpiKV reaches 72% on MATH-500, matching the strongest attention-based baseline (ThinKV 71%, H2O 67%); a lag-normalized KV variant reaches 37% on AIME-2024 at 8192 tokens against the best of them (33%), at up to 2.8x the speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EpiKV matches attention-based eviction numbers on math tasks with a hidden-state change score, but the independence from attention still needs direct tests to hold up.

read the letter

The paper's core move is scoring tokens for KV eviction by the change in upper-mid layer representations during the forward pass, after a causal rolling z-score, instead of attention weights. This lets it run inside FlashAttention without materializing the matrix, and the reported results hit 72% on MATH-500 at 4096 tokens, matching ThinKV, plus a lag-normalized version at 37% on AIME-2024 at 8192 tokens.

What stands out is the simplicity: no training, no classifier, no custom kernel. The method reads the delta straight from the existing forward pass and claims 16x longer feasible context on the same hardware. That combination is useful for anyone shipping long-chain reasoning models today.

The soft spot is exactly the one the stress-test flags. Performance parity alone does not prove the epiphany score is a distinct signal. Without an ablation that swaps the scoring function on identical traces, or a correlation check between epiphany ranks and attention ranks, it is possible the two metrics overlap enough that the eviction dynamics, not the new proxy, drive the results. The abstract gives no such comparison.

The rest of the setup looks clean: free parameters are minimal, the score is computed on the fly, and the citation pattern does not lean on self-referential loops. Still, the experimental section would need tighter controls on baseline implementations and statistical reporting before the independence claim lands.

This is for people who maintain inference stacks for reasoning models and need longer contexts without kernel changes. A reader working on cache policies would find the idea worth testing even if the current evidence is only suggestive.

Send it to peer review. The practical angle is real and the method is lightweight enough that referees can check the missing ablations without much extra work.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes EpiKV, a KV-cache eviction method for long CoT reasoning models that replaces attention-weight scoring with an 'epiphany score' computed from the change in hidden-state representations in upper-mid layers after a causal rolling z-score. The method requires no training, no classifier, and no custom kernel, allowing direct use inside FlashAttention stacks. Reported results show EpiKV reaching 72% on MATH-500 at a 4096-token cache (matching ThinKV at 71% and exceeding H2O at 67%) and a lag-normalized variant reaching 37% on AIME-2024 at 8192 tokens (vs. 33% for the best baseline), while supporting up to 16x longer feasible contexts than attention-based eviction.

Significance. If the epiphany score proves to be a reliable, attention-independent importance signal, the approach would remove a practical barrier to fused-kernel inference for long-context reasoning models and could enable substantially longer CoT traces under fixed memory budgets. The absence of training or custom kernels is a concrete deployment advantage.

major comments (2)

[Experiments / Results] The central claim that the epiphany score supplies an attention-independent importance proxy rests on performance parity alone. No ablation, correlation, or rank-agreement analysis is presented that directly compares epiphany scores against attention weights on identical traces, nor is a controlled swap (attention vs. epiphany under the same eviction logic) reported; without such evidence it remains possible that eviction dynamics rather than the new signal drive the observed results.
[Experiments] The experimental section states performance numbers (72% MATH-500, 37% AIME) but supplies no details on random seeds, number of runs, statistical tests, or exact baseline re-implementations, making it impossible to assess whether the reported parity is robust or sensitive to implementation choices.

minor comments (2)

[Method] The abstract and method description would benefit from an explicit equation or pseudocode block defining the epiphany score (including the precise layer range and z-score window) so readers can reproduce the metric without ambiguity.
[Figures / Tables] Figure captions and table footnotes should clarify whether the 16x context scaling claim is measured under identical hardware or is a theoretical projection based on memory footprint.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to address these points. Below we respond to each major comment.

read point-by-point responses

Referee: [Experiments / Results] The central claim that the epiphany score supplies an attention-independent importance proxy rests on performance parity alone. No ablation, correlation, or rank-agreement analysis is presented that directly compares epiphany scores against attention weights on identical traces, nor is a controlled swap (attention vs. epiphany under the same eviction logic) reported; without such evidence it remains possible that eviction dynamics rather than the new signal drive the observed results.

Authors: We agree that direct evidence comparing the epiphany score to attention weights would strengthen the claim of an attention-independent proxy. Performance parity alone leaves open the possibility that eviction mechanics contribute to the results. In the revised manuscript we will add (i) Spearman rank correlation and top-k overlap statistics between epiphany scores and attention weights computed on the same traces, and (ii) a controlled ablation that applies identical eviction logic with either attention weights or epiphany scores. These additions will isolate the contribution of the new signal. revision: yes
Referee: [Experiments] The experimental section states performance numbers (72% MATH-500, 37% AIME) but supplies no details on random seeds, number of runs, statistical tests, or exact baseline re-implementations, making it impossible to assess whether the reported parity is robust or sensitive to implementation choices.

Authors: We acknowledge the omission of reproducibility details. The revised version will report the number of independent runs, random seeds, any statistical significance tests performed, and precise re-implementation notes for all baselines (including hyper-parameters and code references). revision: yes

Circularity Check

0 steps flagged

No circularity: epiphany score defined directly from forward-pass hidden-state deltas

full rationale

The paper introduces the epiphany score as an explicit definition—the change in internal representations from upper-mid layers after causal rolling z-score—computed solely from the forward pass without reference to attention weights, fitted parameters, or any self-cited prior result. No equation or claim reduces a prediction to its own inputs by construction, no uniqueness theorem is invoked, and no ansatz is smuggled via citation. Empirical performance numbers (72% MATH-500 at 4096 tokens, etc.) are presented as direct comparisons rather than derived quantities. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the assumption that representation change is a valid importance signal; no free parameters, standard mathematical axioms, or new invented entities beyond the named metric are introduced in the abstract.

invented entities (1)

epiphany score no independent evidence
purpose: Token importance metric based on change in internal representation
Defined in the abstract as the change in the model's internal representation read from the forward pass.

pith-pipeline@v0.9.1-grok · 5744 in / 1021 out tokens · 19867 ms · 2026-06-26T05:45:06.279707+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 1 linked inside Pith

[1]

Gonzalez and Hao Zhang and Ion Stoica , title =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , title =. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP) , pages =
[2]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[3]

Publications Manual , year = "1983", publisher =

1983
[4]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =

1981
[5]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[6]

Dan Gusfield , title =. 1997

1997
[7]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[8]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[9]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. Advances in Neural Information Processing Systems (NeurIPS) , year =
[10]

International Conference on Learning Representations (ICLR) , volume=

Efficient Streaming Language Models with Attention Sinks , author =. International Conference on Learning Representations (ICLR) , volume=
[11]

2026 , note =

Ramachandran, Akshat and Neseem, Marina and Sakr, Charbel and Venkatesan, Rangharajan and Khailany, Brucek and Krishna, Tushar , booktitle =. 2026 , note =

2026
[12]

2025 , address =

Hu, Junhao and Huang, Wenrui and Wang, Weidong and Li, Zhenwen and Hu, Tiancheng and Liu, Zhixia and Chen, Xusheng and Xie, Tao and Shan, Yizhou , booktitle =. 2025 , address =

2025
[13]

Liang, Manlai and Zhang, JiaMing and Li, Xiong and Li, Jinlong , journal =
[14]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems (NeurIPS) , volume=
[15]

Dao, Tri , booktitle =
[16]

Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming , booktitle =
[17]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in
[18]

International Conference on Learning Representations (ICLR) , year =

Mass-Editing Memory in a Transformer , author =. International Conference on Learning Representations (ICLR) , year =
[19]

2025 , publisher=

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , journal=. 2025 , publisher=

2025
[20]

Liu, Xiang and Tang, Zhenheng and Dong, Peijie and Li, Zeyu and Liu, Yue and Li, Bo and Hu, Xuming and Chu, Xiaowen , booktitle =
[21]

Cai, Zefan and Zhang, Yichi and Gao, Bofei and Liu, Yuliang and Li, Yucheng and Liu, Tianyu and Lu, Keming and Xiong, Wayne and Dong, Yue and Hu, Junjie and others , booktitle=
[22]

Goel, Raghavv and Park, Junyoung and Gagrani, Mukul and Jones, Dalton and Morse, Matthew and Langston, Harper and Lee, Mingu and Lott, Chris , journal =
[23]

Gu, Yifeng and Jiang, Zicong and Jin, Jianxiu and Guo, Kailing and Zhang, Ziyang and Xu, Xiangmin , journal =
[24]

Su, Yi and Tian, Zhenxu and Qiao, Dan and Zhou, Yuechi and Li, Juntao and Zhang, Min , journal =
[25]

Hooper, Coleman and Kim, Sehoon and Mohammadzadeh, Hiva and Mahoney, Michael W and Shao, Yakun S and Keutzer, Kurt and Gholami, Amir , booktitle =
[26]

Tang, Jiaming and Zhao, Yilong and Zhu, Kan and Xiao, Guangxuan and Kasikci, Baris and Han, Song , booktitle=
[27]

Guangda Liu and Chengwei Li and Zhenyu Ning and Jing Lin and Yiwu Yao and Danning Ke and Minyi Guo and Jieru Zhao , booktitle=. Free
[28]

2024 , organization=

Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang , booktitle=. 2024 , organization=

2024
[29]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Transformer Feed-Forward Layers Are Key-Value Memories , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

2021
[30]

Edward , journal =

Kariyappa, Sanjay and Suh, G. Edward , journal =
[31]

2025 , address =

Sharma, Akshat and Ding, Hangliang and Li, Jianping and Dani, Neel and Zhang, Minjia , booktitle =. 2025 , address =

2025
[32]

Measuring Mathematical Problem Solving with the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the
[33]

International Conference on Learning Representations , volume=

Let's Verify Step by Step , author=. International Conference on Learning Representations , volume=
[34]

arXiv preprint arXiv:2110.14168 , year =

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

Pith/arXiv arXiv

[1] [1]

Gonzalez and Hao Zhang and Ion Stoica , title =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , title =. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP) , pages =

[2] [2]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[3] [3]

Publications Manual , year = "1983", publisher =

1983

[4] [4]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =

1981

[5] [5]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[6] [6]

Dan Gusfield , title =. 1997

1997

[7] [7]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[8] [8]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[9] [9]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. Advances in Neural Information Processing Systems (NeurIPS) , year =

[10] [10]

International Conference on Learning Representations (ICLR) , volume=

Efficient Streaming Language Models with Attention Sinks , author =. International Conference on Learning Representations (ICLR) , volume=

[11] [11]

2026 , note =

Ramachandran, Akshat and Neseem, Marina and Sakr, Charbel and Venkatesan, Rangharajan and Khailany, Brucek and Krishna, Tushar , booktitle =. 2026 , note =

2026

[12] [12]

2025 , address =

Hu, Junhao and Huang, Wenrui and Wang, Weidong and Li, Zhenwen and Hu, Tiancheng and Liu, Zhixia and Chen, Xusheng and Xie, Tao and Shan, Yizhou , booktitle =. 2025 , address =

2025

[13] [13]

Liang, Manlai and Zhang, JiaMing and Li, Xiong and Li, Jinlong , journal =

[14] [14]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems (NeurIPS) , volume=

[15] [15]

Dao, Tri , booktitle =

[16] [16]

Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming , booktitle =

[17] [17]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in

[18] [18]

International Conference on Learning Representations (ICLR) , year =

Mass-Editing Memory in a Transformer , author =. International Conference on Learning Representations (ICLR) , year =

[19] [19]

2025 , publisher=

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , journal=. 2025 , publisher=

2025

[20] [20]

Liu, Xiang and Tang, Zhenheng and Dong, Peijie and Li, Zeyu and Liu, Yue and Li, Bo and Hu, Xuming and Chu, Xiaowen , booktitle =

[21] [21]

Cai, Zefan and Zhang, Yichi and Gao, Bofei and Liu, Yuliang and Li, Yucheng and Liu, Tianyu and Lu, Keming and Xiong, Wayne and Dong, Yue and Hu, Junjie and others , booktitle=

[22] [22]

Goel, Raghavv and Park, Junyoung and Gagrani, Mukul and Jones, Dalton and Morse, Matthew and Langston, Harper and Lee, Mingu and Lott, Chris , journal =

[23] [23]

Gu, Yifeng and Jiang, Zicong and Jin, Jianxiu and Guo, Kailing and Zhang, Ziyang and Xu, Xiangmin , journal =

[24] [24]

Su, Yi and Tian, Zhenxu and Qiao, Dan and Zhou, Yuechi and Li, Juntao and Zhang, Min , journal =

[25] [25]

Hooper, Coleman and Kim, Sehoon and Mohammadzadeh, Hiva and Mahoney, Michael W and Shao, Yakun S and Keutzer, Kurt and Gholami, Amir , booktitle =

[26] [26]

Tang, Jiaming and Zhao, Yilong and Zhu, Kan and Xiao, Guangxuan and Kasikci, Baris and Han, Song , booktitle=

[27] [27]

Guangda Liu and Chengwei Li and Zhenyu Ning and Jing Lin and Yiwu Yao and Danning Ke and Minyi Guo and Jieru Zhao , booktitle=. Free

[28] [28]

2024 , organization=

Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang , booktitle=. 2024 , organization=

2024

[29] [29]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Transformer Feed-Forward Layers Are Key-Value Memories , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

2021

[30] [30]

Edward , journal =

Kariyappa, Sanjay and Suh, G. Edward , journal =

[31] [31]

2025 , address =

Sharma, Akshat and Ding, Hangliang and Li, Jianping and Dani, Neel and Zhang, Minjia , booktitle =. 2025 , address =

2025

[32] [32]

Measuring Mathematical Problem Solving with the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the

[33] [33]

International Conference on Learning Representations , volume=

Let's Verify Step by Step , author=. International Conference on Learning Representations , volume=

[34] [34]

arXiv preprint arXiv:2110.14168 , year =

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

Pith/arXiv arXiv