MultiHashFormer: Hash-based Generative Language Models
Pith reviewed 2026-06-29 04:09 UTC · model grok-4.3
The pith
MultiHashFormer represents each token as a unique sequence of hash IDs from multiple functions to enable autoregressive hash-based language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Each token is represented as a unique hash signature consisting of a short sequence of discrete hash IDs generated by multiple independent hash functions. The Hash Encoder compresses this signature into a single latent vector fed to the Transformer decoder. The Hash Decoder then produces the hash signature of the next token, which is resolved back to the corresponding text token.
What carries the argument
Multi-hash signature: a sequence of hash IDs from independent functions that uniquely identifies a token, allowing compression to one embedding vector and reconstruction by the decoder.
If this is right
- MultiHashFormer models at 100M, 1B and 3B parameters outperform standard Transformer LMs across multiple benchmarks.
- Vocabulary expansion to include additional languages occurs with no change to the overall parameter count.
- Embedding parameters no longer scale linearly with vocabulary size in generative settings.
Where Pith is reading between the lines
- The constant-parameter multilingual property could support models that dynamically incorporate new languages or domain-specific terms after initial training.
- Error accumulation in the hash-decoding step might become measurable only at sequence lengths far beyond typical benchmarks.
- The same multi-hash approach might combine with existing compression methods such as low-rank adapters to further reduce memory.
Load-bearing premise
That mapping a predicted hash signature back to a unique token remains reliable during autoregressive generation without ambiguity or systematic errors.
What would settle it
Generation runs in which the decoder produces hash signatures that map to the wrong token or collide with multiple tokens at a rate high enough to degrade output quality compared with standard models.
Figures
read the original abstract
Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MultiHashFormer, a generative LM architecture in which each token is represented by a unique multi-hash signature (short sequence of discrete hash IDs from independent hash functions). A Hash Encoder compresses the signature to a latent vector processed by a Transformer decoder; a Hash Decoder then autoregressively predicts the next signature, which is mapped back to a token. The paper reports that this yields consistent outperformance versus standard Transformers at the 100M, 1B and 3B scales on multiple benchmarks, while also enabling multilingual vocabulary expansion at constant parameter count.
Significance. If the reported gains are reproducible and the hash-inversion step proves reliable at scale, the framework would address a long-standing barrier to parameter-efficient hashing in causal LMs and could materially reduce the embedding-matrix cost of large or multilingual vocabularies. The work is distinguished by its explicit focus on autoregressive generation rather than encoder-only settings.
major comments (2)
- [Abstract] Abstract: the central empirical claim that MultiHashFormer 'consistently outperforms standard Transformer LMs across multiple benchmarks' at 100M–3B scales is presented without any metrics, baselines, ablation results, or error bars, so it is impossible to determine whether the reported advantage is load-bearing or an artifact of post-hoc choices.
- [Abstract] Abstract: the claim of 'constant parameter footprint' multilingual expansion rests on the unexamined assumption that the Hash Decoder can produce signatures that invert to unique tokens without ambiguity or systematic error during autoregressive generation; no quantitative bound on signature-prediction error rate, collision rate on OOV hash combinations, or recovery accuracy is supplied, leaving the inversion step as the least-secured link in the argument.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We address each point below and will revise the manuscript to improve clarity and substantiation of the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claim that MultiHashFormer 'consistently outperforms standard Transformer LMs across multiple benchmarks' at 100M–3B scales is presented without any metrics, baselines, ablation results, or error bars, so it is impossible to determine whether the reported advantage is load-bearing or an artifact of post-hoc choices.
Authors: We agree that the abstract would be strengthened by including specific metrics. The full manuscript reports detailed results including perplexity reductions and benchmark scores (e.g., on WikiText, C4, and multilingual tasks) with comparisons to standard Transformers at 100M, 1B, and 3B scales, plus ablations. In revision we will add concise quantitative highlights and note the presence of error bars from multiple runs to the abstract. revision: yes
-
Referee: [Abstract] Abstract: the claim of 'constant parameter footprint' multilingual expansion rests on the unexamined assumption that the Hash Decoder can produce signatures that invert to unique tokens without ambiguity or systematic error during autoregressive generation; no quantitative bound on signature-prediction error rate, collision rate on OOV hash combinations, or recovery accuracy is supplied, leaving the inversion step as the least-secured link in the argument.
Authors: The constant-parameter claim follows directly from replacing the embedding matrix with fixed-size hash functions and a decoder that outputs hash IDs rather than expanding vocabulary embeddings. The manuscript provides empirical evidence of successful multilingual expansion on held-out languages without parameter growth. We acknowledge that explicit quantitative bounds on per-step signature error, OOV collision rates, and end-to-end recovery accuracy are not tabulated in the current version; we will add these analyses (computed from the existing training runs) to the revision. revision: yes
Circularity Check
No circularity: empirical architecture proposal with no derivation chain
full rationale
The paper introduces MultiHashFormer as a new framework for hash-based autoregressive LMs, with each token as a multi-hash signature processed by Hash Encoder/Decoder. All central claims (outperformance at 100M-3B scales, constant-footprint multilingual expansion) are presented as results of empirical evaluation on benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The inversion step from predicted hash signature to token is an architectural assumption, not a derivation that reduces to its own inputs by construction. This is a standard non-circular empirical paper.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Hash Encoder
no independent evidence
-
Hash Decoder
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Marah I Abdin, Jyoti Aneja, Harkirat S. Behl, S \' e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, and 8 others. 2024. https://doi.org/10.48550/ARXIV.2412....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.08905 2024
-
[2]
Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. https://arxiv.org/abs/2102.03315 Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge . CoRR, abs/2102.03315
arXiv 2021
-
[3]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. https://doi.org/10.1609/AAAI.V34I05.6239 PIQA: reasoning about physical commonsense in natural language . In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The...
-
[4]
Trenton Bricken and Cengiz Pehlevan. 2021. https://proceedings.neurips.cc/paper/2021/hash/8171ac2c5544a5cb54ac0f38bf477af4-Abstract.html Attention approximates sparse distributed memory . In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pag...
2021
-
[5]
Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, and Furu Wei. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.148 Instruction pre-training: Language models are supervised multitask learners . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2529--2550, Miami, Florida, USA. Association for ...
-
[6]
Clark, Dan Garrette, Iulia Turc, and John Wieting
Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. 2022. https://doi.org/10.1162/tacl_a_00448 Canine: Pre-training an efficient tokenization-free encoder for language representation . Transactions of the Association for Computational Linguistics, 10:73--91
-
[7]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. https://doi.org/10.1145/997817.997857 Locality-sensitive hashing scheme based on p-stable distributions . In Proceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 8-11, 2004 , pages 253--262. ACM
-
[8]
Bj \"o rn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, and Samuel Weinbach. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1217 T - FREE : Subword tokenizer-free generative LLM s via sparse representations for memory-efficient embeddings . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pag...
-
[9]
Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee F. Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, and 48 others. 2025. https://doi.org/10.48550/ARXIV.2512.13961 Olm...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13961 2025
-
[10]
Kuzman Ganchev and Mark Dredze. 2008. https://aclanthology.org/W08-0804/ Small statistical models by random feature mixing . In Proceedings of the ACL -08: HLT Workshop on Mobile Language Processing , pages 19--20, Columbus, Ohio. Association for Computational Linguistics
2008
-
[11]
Christoph Goller and Andreas K \" u chler. 1996. https://doi.org/10.1109/ICNN.1996.548916 Learning task-dependent distributed representations by backpropagation through structure . In Proceedings of International Conference on Neural Networks (ICNN'96), Washington, DC, USA, June 3-6, 1996, pages 347--352. IEEE
-
[12]
Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2024. https://proceedings.mlr.press/v235/guo24e.html On the embedding collapse when scaling up recommendation models . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of Machine Learning Research, pages 1...
2024
-
[13]
Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Li, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, and Yin Zheng. 2025. https://doi.org/10.48550/ARXIV.2506.19468 Mubench: Assessment of multilingual capabilities of large language models across 61 languages . CoRR, abs/2506.19468
-
[14]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net
2021
-
[15]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. http://papers.nips.cc...
2022
-
[16]
Sukjun Hwang, Brandon Wang, and Albert Gu. 2025. https://doi.org/10.48550/ARXIV.2507.07955 Dynamic chunking for end-to-end hierarchical sequence modeling . CoRR, abs/2507.07955
-
[17]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \' e lio Renard Lavaud, Marie - Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \' e e Lacroix, and William El Sayed. 2023. https://doi.org/...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
-
[18]
Taku Kudo and John Richardson. 2018. https://doi.org/10.18653/v1/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for Compu...
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
-
[19]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 RACE : Large-scale R e A ding comprehension dataset from examinations . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark. Association for Computational Linguistics
-
[20]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. https://openreview.net/forum?id=H1eA7AEtvS ALBERT: A lite BERT for self-supervised learning of language representations . In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net
2020
-
[21]
Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. 2025. https://doi.org/10.48550/ARXIV.2503.04715 Predictable scale: Part I - optimal hyperparameter scaling law in large language model pretraining . CoRR, abs/2503.04715
-
[22]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics
-
[23]
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://doi.org/10.18653/v1/D18-1260 Can a suit of armor conduct electricity? a new dataset for open book question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391, Brussels, Belgium. Association for Computational Li...
-
[24]
Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, and Valentin Hofmann. 2025. https://doi.org/10.48550/ARXIV.2512.15586 Bolmo: Byteifying the next generation of language models . CoRR, abs/2512.15586
-
[25]
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. https://doi.org/10.18653/v1/N16-1098 A corpus and cloze evaluation for deeper understanding of commonsense stories . In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computation...
-
[26]
Nandini Mundra, Aditya Nanda Kishore Khandavally, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, and Mitesh M Khapra. 2024. https://doi.org/10.18653/v1/2024.conll-1.8 An empirical comparison of vocabulary expansion and initialization approaches for language models . In Proceedings of the 28th Conference on Computational Natural Language Learning, pages...
-
[27]
Itay Nakash, Nitay Calderon, Eyal Ben - David, Elad Hoffer, and Roi Reichart. 2025. https://doi.org/10.48550/ARXIV.2503.19693 Adaptivocab: Enhancing LLM efficiency in focused domains through lightweight vocabulary adaptation . CoRR, abs/2503.19693
-
[28]
Abraham Toluwase Owodunni and Sachin Kumar. 2025. https://doi.org/10.48550/ARXIV.2509.11414 Continually adding new languages to multilingual language models . CoRR, abs/2509.11414
-
[29]
Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srini Iyer. 2025. https://doi.org/10.18653/v1/2025.acl-long.453 Byte latent transformer: Patches scale better than tokens . In Proceedings of the 63rd Annual ...
-
[30]
Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...
-
[31]
Raffel, Leandro von Werra, and Thomas Wolf
Guilherme Penedo, Hynek Kydl \' cek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. 2024. http://papers.nips.cc/paper\_files/paper/2024/hash/370df50ccfdf8bde18f8f9c2d9151bda-Abstract-Datasets\_and\_Benchmarks\_Track.html The fineweb datasets: Decanting the web for the finest text data at scale . In...
2024
-
[32]
Guilherme Penedo, Hynek Kydl \' cek, Vinko Sabolcec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro von Werra, and Thomas Wolf. 2025. https://doi.org/10.48550/ARXIV.2506.20920 Fineweb2: One pipeline to scale them all - adapting pre-training data processing to every language . CoRR, abs/2506.20920
-
[33]
Mohammad Taher Pilehvar, Dimitri Kartsaklis, Victor Prokhorov, and Nigel Collier. 2018. https://doi.org/10.18653/v1/D18-1169 Card-660: C ambridge rare word dataset - a reliable benchmark for infrequent word representation models . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1391--1401, Brussels, Belgium...
-
[34]
Prafull Prakash, Saurabh Kumar Shashidhar, Wenlong Zhao, Subendhu Rongali, Haidar Khan, and Michael Kayser. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.423 Compressing transformer-based semantic parsing models using compositional code embeddings . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4711--4717, Online....
-
[35]
Ofir Press and Lior Wolf. 2017. https://aclanthology.org/E17-2025/ Using the output embedding to improve language models . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 157--163, Valencia, Spain. Association for Computational Linguistics
2017
-
[36]
Fran c ois Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, and Thomas Demeester. 2024. https://doi.org/10.48550/ARXIV.2408.04303 Trans-tokenization and cross-lingual vocabulary transfers: Language adaptation of llms for low-resource NLP . CoRR, abs/2408.04303
-
[37]
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. 2011. http://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418 Choice of plausible alternatives: An evaluation of commonsense causal reasoning . In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA...
2011
-
[38]
Chinnadhurai Sankar, Sujith Ravi, and Zornitsa Kozareva. 2021. https://doi.org/10.18653/v1/2021.eacl-main.246 P ro F ormer: Towards on-device LSH projection based transformers . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2823--2828, Online. Association for Computationa...
-
[39]
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. https://doi.org/10.18653/v1/D19-1454 Social IQ a: Commonsense reasoning about social interactions . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),...
-
[40]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics
-
[41]
Hajime Senuma. 2025. https://doi.org/10.21105/JOSS.06124 mmh3: A python extension for murmurhash3 . J. Open Source Softw., 10(105):6124
-
[42]
Raphael Shu and Hideki Nakayama. 2018. https://openreview.net/forum?id=BJRZzFlRb Compressing word embeddings via deep compositional code learning . In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net
2018
-
[43]
Skarda and Walter J
Christine A. Skarda and Walter J. Freeman. 1987. How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10(2):161--173
1987
-
[44]
Ng, and Christopher D
Richard Socher, Cliff Chiung - Yu Lin, Andrew Y. Ng, and Christopher D. Manning. 2011. https://icml.cc/2011/papers/125\_icmlpaper.pdf Parsing natural scenes and natural language with recursive neural networks . In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011 , pages 129-...
2011
-
[45]
Dan Svenstrup, Jonas Meinertz Hansen, and Ole Winther. 2017. https://proceedings.neurips.cc/paper/2017/hash/f0f6ba4b5e0000340312d33c212c3ae8-Abstract.html Hash embeddings for efficient word representations . In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Be...
2017
-
[46]
Qwen Team. 2025. https://doi.org/10.48550/ARXIV.2505.09388 Qwen3 technical report . CoRR, abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[47]
Atula Tejaswi, Nilesh Gupta, and Eunsol Choi. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.614 Exploring design choices for building language-specific LLM s . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10485--10500, Miami, Florida, USA. Association for Computational Linguistics
-
[48]
Ichiro Tsuda. 2001. https://doi.org/10.1017/S0140525X01000097 Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems . Behavioral and Brain Sciences, 24(5):793–810
-
[49]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html Attention is all you need . In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Proces...
2017
-
[50]
Mathurin Videau, Badr Youbi Idrissi, Alessandro Ferreira Leite, Marc Schoenauer, Olivier Teytaud, and David Lopez - Paz. 2025. https://doi.org/10.48550/ARXIV.2506.14761 From bytes to ideas: Language modeling with autoregressive u-nets . CoRR, abs/2506.14761
-
[51]
Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics
-
[52]
Huiyin Xue and Nikolaos Aletras. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.536 H ash F ormers: Towards vocabulary-independent pre-trained transformers . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7862--7874, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics
-
[53]
Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. https://doi.org/10.1162/tacl_a_00461 B y T 5: Towards a token-free future with pre-trained byte-to-byte models . Transactions of the Association for Computational Linguistics, 10:291--306
-
[54]
Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, and Nikolaos Aletras. 2025. https://openreview.net/forum?id=6IdoIKowfe Adapting chat language models using only target unlabeled language data . Trans. Mach. Learn. Res., 2025
2025
-
[55]
Atsuki Yamaguchi, Aline Villavicencio, and Nikolaos Aletras. 2026. https://doi.org/10.1162/COLI.a.581 How can we effectively expand the vocabulary of LLM s with 0.01 GB of target language text? Computational Linguistics, 52(1):295--330
-
[56]
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. https://openreview.net/forum?id=HkwZSG-CZ Breaking the softmax bottleneck: A high-rank RNN language model . In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net
2018
-
[57]
Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong, and Furu Wei. 2021. https://doi.org/10.18653/v1/2021.findings-acl.40 Adapt-and-distill: Developing small, fast and effective pretrained language models for domains . In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 460--470, Online. Association for Computational Linguistics
-
[58]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics
-
[59]
Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. https://arxiv.org/abs/1810.12885 Record: Bridging the gap between human and machine commonsense reading comprehension . CoRR, abs/1810.12885
Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.