Recognition: 2 theorem links
· Lean TheoremEMO: Pretraining Mixture of Experts for Emergent Modularity
Pith reviewed 2026-05-12 02:53 UTC · model grok-4.3
The pith
EMO pretrains mixture-of-experts models so expert subsets emerge as independent modules for domains with little performance loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By forcing tokens within each document to draw experts from the same pool while different documents may use different pools, EMO causes expert subsets to specialize in semantic domains such as mathematics or programming code. A 14-billion-parameter total model with 1 billion active parameters trained this way performs on par with standard MoEs, yet retains nearly all accuracy when only 25 percent or even 12.5 percent of the experts are kept for inference.
What carries the argument
The per-document shared expert pool constraint that induces coherent groupings from document boundaries alone.
If this is right
- The full EMO model matches the accuracy of a standard mixture-of-experts model of the same size.
- Retaining only 25 percent of experts causes an absolute performance drop of 1 percent.
- Retaining only 12.5 percent of experts causes an absolute performance drop of 3 percent.
- The specialized expert subsets focus on semantic domains rather than low-level syntactic features.
Where Pith is reading between the lines
- Expert groups trained this way might be combined across different EMO models to create new capabilities without retraining.
- The method could make it practical to deploy very large models on edge devices by loading only domain-relevant experts.
- Similar constraints might be explored for other forms of modularity, such as task-specific or language-specific groupings.
Load-bearing premise
That the natural boundaries between documents during pretraining are enough to produce coherent, reusable expert specializations.
What would settle it
Train the same architecture without the document-level expert pool sharing and check whether expert subsets still allow low-degradation domain-specific inference.
Figures
read the original abstract
Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EMO, a Mixture-of-Experts architecture that imposes a document-boundary constraint during pretraining: tokens within the same document must select experts from a shared pool, while different documents may use different pools. This is intended to induce emergent modularity so that coherent expert subsets (specializing at semantic levels such as math or code) can be retained independently. A 1B-active / 14B-total parameter model is pretrained on 1T tokens; as a full model it matches standard MoE performance, but retaining only 25% (12.5%) of experts incurs only a 1% (3%) absolute drop on downstream tasks, whereas standard MoEs degrade sharply. The authors claim this enables memory-efficient, composable deployment without human priors or post-training routing adjustments.
Significance. If the reported robustness and semantic specialization hold under rigorous controls, the work would provide a practical route toward modular sparse models that can be deployed in memory-constrained settings by activating only domain-relevant expert subsets. The contrast with standard MoEs and the demonstration that document boundaries alone suffice for coherent groupings would be a notable empirical contribution to efficient LLM scaling.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the central performance claims (1% drop at 25% retention, 3% drop at 12.5% retention) are stated without naming the evaluation benchmarks, the precise standard-MoE baselines, training hyperparameters, or any statistical significance tests; these omissions prevent verification that the reported gains are attributable to the proposed constraint rather than data statistics or implementation details.
- [§3 and §4.2] §3 (Method) and §4.2 (Ablations): no ablation is presented that trains an otherwise identical MoE while removing the per-document expert-pool constraint. Without this control, it remains possible that the observed subset robustness and semantic specialization arise from router initialization, data distribution, or other unstated choices rather than the document-boundary mechanism itself.
- [§4.3] §4.3 (Expert Analysis): the claim that EMO experts specialize at semantic levels (math, code) rather than low-level syntax is supported only by qualitative examples; quantitative metrics (e.g., expert activation overlap across domains, or controlled probes) are needed to substantiate the contrast with standard MoEs.
minor comments (2)
- [§3] Notation for the expert-pool restriction (e.g., how the shared pool is sampled per document) should be formalized with an equation or pseudocode for reproducibility.
- [§4.1] The manuscript should include the exact token counts, batch sizes, and learning-rate schedules used for both EMO and the standard-MoE baseline to allow direct replication.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity, rigor, and completeness. We address each major comment point-by-point below, with revisions incorporated where the manuscript required strengthening.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central performance claims (1% drop at 25% retention, 3% drop at 12.5% retention) are stated without naming the evaluation benchmarks, the precise standard-MoE baselines, training hyperparameters, or any statistical significance tests; these omissions prevent verification that the reported gains are attributable to the proposed constraint rather than data statistics or implementation details.
Authors: We agree that the abstract and §4 would benefit from explicit naming of benchmarks, baselines, and supporting details to facilitate verification. In the revised manuscript, we have updated the abstract to name the key downstream benchmarks (MMLU, GSM8K, HumanEval, and MBPP) and clarified that the standard MoE baseline is an identically sized 1B-active/14B-total model trained on the same 1T tokens without the document-boundary constraint. Hyperparameters are now summarized in §4 with full details moved to Appendix A. Results are reported as means over three independent runs with standard deviations to indicate statistical significance. These additions confirm the robustness differences are attributable to the EMO constraint rather than other factors. revision: yes
-
Referee: [§3 and §4.2] §3 (Method) and §4.2 (Ablations): no ablation is presented that trains an otherwise identical MoE while removing the per-document expert-pool constraint. Without this control, it remains possible that the observed subset robustness and semantic specialization arise from router initialization, data distribution, or other unstated choices rather than the document-boundary mechanism itself.
Authors: This is a fair and important point; a direct control isolating the document-boundary constraint is necessary to rule out confounds. While §4.2 already contains ablations on pool size and routing variants, it did not include the exact control of an otherwise identical standard MoE. We have added this control experiment to the revised §4.2 (new Table 2 and Figure 3), demonstrating that the standard MoE suffers substantially larger drops (12-18% absolute at 25% retention) under the same retention settings. We have also revised the method description in §3 to explicitly state that the per-document expert-pool constraint is the sole difference from the baseline MoE. revision: yes
-
Referee: [§4.3] §4.3 (Expert Analysis): the claim that EMO experts specialize at semantic levels (math, code) rather than low-level syntax is supported only by qualitative examples; quantitative metrics (e.g., expert activation overlap across domains, or controlled probes) are needed to substantiate the contrast with standard MoEs.
Authors: We acknowledge that quantitative metrics would provide stronger substantiation for the semantic specialization claim. The original §4.3 relied primarily on qualitative activation examples. In the revision, we have expanded §4.3 with quantitative analyses, including the average Jaccard similarity of activated expert sets across documents from distinct semantic domains (0.12 for EMO vs. 0.38 for standard MoEs) and controlled probe experiments measuring domain-specific performance degradation after expert subset deactivation. These results are now reported in the main text of §4.3 with additional details in Appendix D, clearly contrasting the specialization patterns. revision: yes
Circularity Check
No significant circularity; empirical training procedure with held-out evaluation
full rationale
The paper describes an empirical pretraining method for a Mixture-of-Experts architecture that imposes a document-boundary constraint on expert selection during training. All reported outcomes (full-model parity with standard MoEs, robustness to expert subset retention, and observed semantic specialization) are measured via standard held-out perplexity and downstream benchmarks after training completes. No equations, predictions, or first-principles derivations are presented that reduce the claimed gains to quantities defined by the same fitted parameters or by self-citation chains. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes that collapse into the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Document boundaries provide a reliable signal for domain similarity without additional supervision.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EMO restricts [tokens within a document] to select experts from a shared pool... using document boundaries alone... expert subsets specialize at semantic levels (e.g., domains such as math or code)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, 11 Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Ge...
work page 2026
-
[2]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...
work page 2025
-
[3]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[4]
Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian L...
work page 2024
-
[5]
Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y
Damai Dai, Chengqi Deng, Chenggang Zhao, R.x. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y .k. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors...
work page 2024
-
[6]
Branch-train-MiX: Mixing expert LLMs into a mixture-of-experts LLM
Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, and Xian Li. Branch-train-MiX: Mixing expert LLMs into a mixture-of-experts LLM. InConference on Language Modeling, 2024
work page 2024
-
[7]
Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, and Sewon Min
Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, and Sewon Min. FlexOlmo: Open...
work page 2025
-
[8]
Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, and Maosong Sun. Blockffn: Towards end-side acceleration-friendly mixture-of-experts with chunk-level activation sparsity, 2025
work page 2025
-
[9]
Temporally Extended Mixture-of-Experts Models
Zeyu Shen and Peter Henderson. Temporally extended mixture-of-experts models.arXiv preprint arXiv:2604.20156, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Suraiya Tairin, Shohaib Mahmud, Haiying Shen, and Anand Iyer. emoe: Task-aware memory efficient mixture-of-experts-based (moe) model inference.arXiv preprint arXiv:2503.06823, 2025
-
[11]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InProceedings of the International Conference on Learning Representations, 2017
work page 2017
-
[12]
GShard: Scaling giant models with condi- tional computation and automatic sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with condi- tional computation and automatic sharding. InProceedings of the International Conference on Learning Representations, 2021
work page 2021
-
[13]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[14]
Can mixture-of-experts surpass dense llms under strictly equal resources? In ICLR, 2026
Houyi Li, Ka Man Lo, Ziqi Wang, Zili Wang, Wenzhen Zheng, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Can mixture-of-experts surpass dense llms under strictly equal resources? In ICLR, 2026. 13
work page 2026
-
[15]
Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of- experts large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vo...
work page 2024
-
[16]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts language models.Proceedings of the International Conference on Learning Representations, 2025
work page 2025
-
[18]
Moe lens – an expert is all you need, 2026
Marmik Chaudhari, Idhant Gulati, Nishkal Hundia, Pranav Karra, and Shivam Raval. Moe lens – an expert is all you need, 2026
work page 2026
-
[19]
Xi Wang, Soufiane Hayou, and Eric Nalisnick. The myth of expert specialization in moes: Why routing reflects geometry, not necessarily domain expertise.arXiv preprint arXiv:2604.09780, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Quantifying expert specialization for effective pruning in mixture-of-experts models, 2025
jie hu, Jiahui Hou, and Xiangyang Li. Quantifying expert specialization for effective pruning in mixture-of-experts models, 2025
work page 2025
-
[21]
Domain-specific pruning of large mixture-of-experts models with few-shot demonstrations
Zican Dong, Han Peng, Peiyu Liu, Wayne Xin Zhao, Dong Wu, Feng Xiao, and Zhifeng Wang. Domain-specific pruning of large mixture-of-experts models with few-shot demonstrations. In Proceedings of Advances in Neural Information Processing Systems, 2025
work page 2025
-
[22]
Task-specific expert pruning for sparse mixture-of-experts, 2022
Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. Task-specific expert pruning for sparse mixture-of-experts, 2022
work page 2022
-
[23]
Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji, and Liujuan Cao. Discovering important experts for mixture-of-experts models pruning through a theoretical perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[24]
Dokania, Adel Bibi, and Philip Torr
Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, and Philip Torr. Mixture of experts made intrinsically interpretable. In Proceedings of the International Conference of Machine Learning, 2025. Poster
work page 2025
-
[25]
Jungwoo Park, Young Jin Ahn, Kee-Eung Kim, and Jaewoo Kang. Monet: Mixture of monose- mantic experts for transformers.Proceedings of the International Conference on Learning Representations, 2025
work page 2025
-
[26]
Rizhen Hu, Yuan Cao, Boao Kong, Mou Sun, and Kun Yuan. Improving moe performance and efficiency with plug-and-play intra-layer specialization and cross-layer coupling losses, 2026
work page 2026
-
[27]
Advancing expert specialization for better MoE
Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Xinye Cao, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for better MoE. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[28]
Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. Branch-train-merge: Embarrassingly parallel training of expert language models, 2022
work page 2022
-
[29]
Moduleformer: Modularity emerges from mixture-of-experts, 2023
Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, and Chuang Gan. Moduleformer: Modularity emerges from mixture-of-experts, 2023
work page 2023
-
[30]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 14
work page 2017
-
[31]
Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5005...
work page 2025
-
[32]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[33]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Huma...
work page 2019
-
[34]
CommonsenseQA: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volum...
work page 2019
-
[35]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, edi- tors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics
work page 2019
-
[36]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018
work page 2018
-
[37]
PIQA: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InAssociation for the Advancement of Artificial Intelligence, 2020
work page 2020
-
[38]
SocialIQA: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. SocialIQA: Commonsense reasoning about social interactions. InProceedings of Empirical Methods in Natural Language Processing, 2019
work page 2019
-
[39]
WinoGrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. InAssociation for the Advancement of Artificial Intelligence, 2020
work page 2020
-
[40]
Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: A conversational question answering challenge.Transactions of the Association for Computational Linguistics, 7:249–266, 2019
work page 2019
-
[41]
SQuAD: 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras, editors,Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics
work page 2016
-
[42]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...
work page 2019
-
[43]
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and 15 Min-Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 20...
work page 2017
-
[44]
DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu...
work page 2019
-
[45]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[46]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024
work page 2024
-
[47]
Training verifiers to solve math word problems, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021
work page 2021
-
[48]
Organize the web: Constructing domains enhances pre-training data curation
Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, and Luca Soldaini. Organize the web: Constructing domains enhances pre-training data curation. InProceedings of the International Conference of Machine Learning, 2025
work page 2025
-
[49]
Deepseek-v4: Towards highly efficient million-token context intelligence
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, 2026. Available at Hugging Face
work page 2026
-
[50]
Kimi Team, Yifan Bai, Yiping Bao, Y . Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong G...
work page 2026
-
[51]
So 2/3 of 9 is 6. So the answer is 6 more hours. 9 - 6 = 3. So the answer is 3. Reg. MoE, 32-expert subset✗ Harry slept 9 hours last night. So the answer is 9 - 2 = 7. So the answer is 7. EMO, 32-expert subset✓ Harry slept 9 hours. James slept 2/3 of 9 hours. So James slept 9 * 2/3 = 6 hours. So Harry slept 9 - 6 = 3 hours more than James. 22 Reg. MoE, 12...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.